219 lines
7.9 KiB
Markdown
219 lines
7.9 KiB
Markdown
# officeconvert
|
|
|
|
officeconvert is a multimodule conversion toolkit for turning presentation files into
|
|
typed `SlideDeck` artifacts with rendered slide images and notes. The repository is
|
|
organized around Protocol Buffer schemas with ConnectRPC code generation for both server
|
|
and client compatibility.
|
|
|
|
## Modules
|
|
|
|
- `proto/` contains protobuf schemas and RPC definitions.
|
|
- `gen/python` and `gen/go` contain generated protocol and Connect code.
|
|
- `python/packages/officeconvert` is the core conversion library (PPTX -> PDF -> images + notes).
|
|
- `python/packages/server` is the ConnectRPC Python server with SeaweedFS (S3-compatible) orchestration.
|
|
- `clients/go` is the first client library with layered orchestration helpers.
|
|
- `deploy/` contains production-ish and dev Docker Compose files.
|
|
|
|
## Supported Document Types
|
|
|
|
MVP currently supports **PPTX only** and produces a `SlideDeck` result containing:
|
|
|
|
- ordered slide image URLs
|
|
- plain-text notes per slide
|
|
|
|
## Quick Commands
|
|
|
|
Use the root `Makefile`:
|
|
|
|
- `make buf-lint` to lint protobufs
|
|
- `make buf-generate` to regenerate Go and Python types
|
|
- `make py-sync` to sync Python workspace dependencies with uv
|
|
- `make go-test` to run Go client tests
|
|
- `make compose-up` to run server + SeaweedFS
|
|
- `make compose-up-dev` to run SeaweedFS only
|
|
- `make run-server` to start host `uvicorn` with `.env` (if present) plus defaults
|
|
|
|
## Development Server Workflow
|
|
|
|
This is the recommended local workflow for iterating on the Python server and conversion
|
|
library while keeping SeaweedFS in Docker.
|
|
|
|
### 1) Prerequisites
|
|
|
|
- `buf` on your `PATH`
|
|
- `uv` on your `PATH`
|
|
- Docker + Docker Compose
|
|
- Local tools if running server on host (not in container):
|
|
- LibreOffice (`soffice`)
|
|
- Poppler (`pdftoppm`)
|
|
|
|
### 2) Generate typed API code
|
|
|
|
From repo root:
|
|
|
|
```bash
|
|
make buf-lint
|
|
make buf-generate
|
|
```
|
|
|
|
### 3) Sync Python workspace dependencies
|
|
|
|
From repo root:
|
|
|
|
```bash
|
|
make py-sync
|
|
```
|
|
|
|
### 4) Start SeaweedFS dependency stack (dev compose)
|
|
|
|
From repo root:
|
|
|
|
```bash
|
|
make compose-up-dev
|
|
```
|
|
|
|
SeaweedFS endpoints:
|
|
|
|
- S3 API: `http://localhost:8333`
|
|
- Master API: `http://localhost:9333`
|
|
- Filer API: `http://localhost:8888`
|
|
- Default S3 creds: `minioadmin` / `minioadmin`
|
|
|
|
### 5) Start Connect server (host process)
|
|
|
|
In a separate terminal, from repo root:
|
|
|
|
```bash
|
|
make run-server
|
|
```
|
|
|
|
`make run-server` behavior:
|
|
|
|
- loads `.env` automatically if present
|
|
- applies reasonable defaults when values are not set
|
|
- defaults S3 endpoint to `localhost:8333` for host-based development
|
|
- auto-normalizes `seaweedfs:8333` to `localhost:8333` for host runs
|
|
- supports optional `UVICORN_HOST` and `UVICORN_PORT` overrides
|
|
- exposes conversion timeout tuning vars (`CONVERSION_PPTX_TO_PDF_TIMEOUT_SECONDS`, `CONVERSION_PDF_TO_IMAGES_TIMEOUT_SECONDS`)
|
|
|
|
Server endpoint base URL:
|
|
|
|
- `http://localhost:8080`
|
|
|
|
### 6) Quick smoke test
|
|
|
|
Create a conversion request:
|
|
|
|
```bash
|
|
curl \
|
|
--header "Content-Type: application/json" \
|
|
--data '{
|
|
"sourceFilename":"example.pptx",
|
|
"full":{"resolution":"CONVERSION_RESOLUTION_FHD","jpeg":{"quality":85}},
|
|
"thumbnail":{"resolution":"CONVERSION_RESOLUTION_SD","jpeg":{"quality":75}}
|
|
}' \
|
|
http://localhost:8080/officeconvertapi.v1.ConversionService/CreateConversion
|
|
```
|
|
|
|
Then:
|
|
|
|
1. Upload the PPTX to the returned `uploadUrl` using HTTP `PUT`.
|
|
2. Call `StartConversion` with the returned `conversionId`.
|
|
3. Poll `GetConversionStatus` until `CONVERSION_STATUS_SUCCEEDED`.
|
|
4. Call `GetSlideDeck` and download each `imageUrl`.
|
|
5. Optionally call `DeleteConversion` for early cleanup.
|
|
|
|
### 7) Full container workflow (optional)
|
|
|
|
If you want to run both server and SeaweedFS in Docker:
|
|
|
|
```bash
|
|
make compose-up
|
|
```
|
|
|
|
Use `.env.example` as your baseline env configuration.
|
|
|
|
## Storage Backend Notes
|
|
|
|
- Local development defaults to **SeaweedFS** (S3-compatible) via Docker Compose. Compose runs an `s3-init` step that creates the dev bucket before the server starts.
|
|
- Production can use any S3-compatible provider; **AWS S3** is the expected choice.
|
|
- The Python server uses the `minio` Python SDK against the S3 API.
|
|
- Runtime configuration uses `S3_*` environment variables.
|
|
- All conversions share one bucket (`S3_BUCKET`, required). Each conversion's objects live under a `{conversion_id}/` key prefix (for example `{conversion_id}/input/source.pptx` and `{conversion_id}/output/slide-0001.jpg`).
|
|
|
|
### AWS setup
|
|
|
|
**Bucket**
|
|
|
|
1. Create one bucket (for example `officeconvert-prod`) in the region where the server runs.
|
|
2. Leave **Block Public Access** enabled. Presigned URLs work without a public bucket.
|
|
3. Optional: add a lifecycle rule to expire objects after a few days as a safety net if cleanup fails.
|
|
|
|
**Server environment**
|
|
|
|
Set at minimum:
|
|
|
|
```bash
|
|
S3_BUCKET=officeconvert-prod
|
|
S3_ENDPOINT=s3.us-east-1.amazonaws.com
|
|
S3_PUBLIC_ENDPOINT=s3.us-east-1.amazonaws.com
|
|
S3_REGION=us-east-1
|
|
S3_USE_SSL=true
|
|
S3_PUBLIC_USE_SSL=true
|
|
S3_ACCESS_KEY=...
|
|
S3_SECRET_KEY=...
|
|
```
|
|
|
|
Use your bucket's regional hostname for both endpoints unless you deliberately split internal vs client-facing access. `S3_PUBLIC_ENDPOINT` must be reachable by whatever uploads and downloads via presigned URLs (clients, not just the server).
|
|
|
|
On startup the server verifies the bucket exists via HeadBucket and fails fast if it is missing. **Pre-create the bucket** before deploying (see IAM below).
|
|
|
|
**IAM permissions**
|
|
|
|
Scope access to the single bucket. Object keys are per-conversion prefixes, so list/delete can target the whole bucket. Startup verification uses HeadBucket, which is satisfied by `s3:ListBucket` on the bucket ARN:
|
|
|
|
```json
|
|
{
|
|
"Version": "2012-10-17",
|
|
"Statement": [
|
|
{
|
|
"Effect": "Allow",
|
|
"Action": ["s3:ListBucket", "s3:HeadBucket"],
|
|
"Resource": "arn:aws:s3:::officeconvert-prod"
|
|
},
|
|
{
|
|
"Effect": "Allow",
|
|
"Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
|
|
"Resource": "arn:aws:s3:::officeconvert-prod/*"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**CORS**
|
|
|
|
Required only if uploads or downloads go **directly from a browser** to presigned URLs. Server-side clients (`curl`, the Go client) do not need CORS. Allow `PUT` and `GET` for your web origin on the bucket.
|
|
|
|
**IAM roles vs IAM users**
|
|
|
|
AWS recommends **roles** over long-lived **IAM user** access keys when the server runs on AWS compute (ECS, EC2, Lambda): a role grants **temporary** credentials that rotate automatically, with no static keys to store or leak.
|
|
|
|
For this project today, the server reads explicit `S3_ACCESS_KEY` and `S3_SECRET_KEY` via the MinIO SDK. That maps cleanly to:
|
|
|
|
| Where you run | Practical choice |
|
|
|---------------|------------------|
|
|
| Docker on a VPS, bare metal, or outside AWS | IAM **user** with the policy above; store keys in env or a secrets manager. Fine for a single service at low volume. |
|
|
| ECS / EC2 / EKS on AWS | Prefer an IAM **role** attached to the task or instance. Your orchestrator injects short-lived credentials; you still pass them into `S3_ACCESS_KEY` / `S3_SECRET_KEY` (and a session token if your runtime provides one — the server does not yet read a dedicated `S3_SESSION_TOKEN` env var). |
|
|
|
|
## Conversion Tuning Notes
|
|
|
|
If conversion fails on larger decks, tune these environment variables:
|
|
|
|
- `CreateConversionRequest.full.resolution` controls full-size output dimensions via presets: `SD`, `HD`, `FHD`, `QHD`, `UHD`.
|
|
- `CreateConversionRequest.thumbnail.resolution` controls thumbnail output dimensions with the same presets.
|
|
- Omitting full/thumbnail resolution (or sending `CONVERSION_RESOLUTION_UNSPECIFIED`) defaults to `FHD` for full and `SD` for thumbnail.
|
|
- Output is JPEG-only for now; set `CreateConversionRequest.full.jpeg.quality` and `CreateConversionRequest.thumbnail.jpeg.quality` to `1..100` (`0` or omitted uses server defaults: full `85`, thumbnail `75`).
|
|
- Rasterization DPI is inferred automatically from source slide size and selected full/thumbnail output dimensions.
|
|
- `CONVERSION_PPTX_TO_PDF_TIMEOUT_SECONDS` (default `180`): timeout for LibreOffice export.
|
|
- `CONVERSION_PDF_TO_IMAGES_TIMEOUT_SECONDS` (default `1800`): timeout for Poppler rasterization.
|