Files
officeconvert/README.md
T
end c0ff6ad635
Docker server image / build-and-push (push) Successful in 1m6s
don't use S3 CreateBucket and clean up
2026-06-17 16:58:02 -07:00

219 lines
7.9 KiB
Markdown

# officeconvert
officeconvert is a multimodule conversion toolkit for turning presentation files into
typed `SlideDeck` artifacts with rendered slide images and notes. The repository is
organized around Protocol Buffer schemas with ConnectRPC code generation for both server
and client compatibility.
## Modules
- `proto/` contains protobuf schemas and RPC definitions.
- `gen/python` and `gen/go` contain generated protocol and Connect code.
- `python/packages/officeconvert` is the core conversion library (PPTX -> PDF -> images + notes).
- `python/packages/server` is the ConnectRPC Python server with SeaweedFS (S3-compatible) orchestration.
- `clients/go` is the first client library with layered orchestration helpers.
- `deploy/` contains production-ish and dev Docker Compose files.
## Supported Document Types
MVP currently supports **PPTX only** and produces a `SlideDeck` result containing:
- ordered slide image URLs
- plain-text notes per slide
## Quick Commands
Use the root `Makefile`:
- `make buf-lint` to lint protobufs
- `make buf-generate` to regenerate Go and Python types
- `make py-sync` to sync Python workspace dependencies with uv
- `make go-test` to run Go client tests
- `make compose-up` to run server + SeaweedFS
- `make compose-up-dev` to run SeaweedFS only
- `make run-server` to start host `uvicorn` with `.env` (if present) plus defaults
## Development Server Workflow
This is the recommended local workflow for iterating on the Python server and conversion
library while keeping SeaweedFS in Docker.
### 1) Prerequisites
- `buf` on your `PATH`
- `uv` on your `PATH`
- Docker + Docker Compose
- Local tools if running server on host (not in container):
- LibreOffice (`soffice`)
- Poppler (`pdftoppm`)
### 2) Generate typed API code
From repo root:
```bash
make buf-lint
make buf-generate
```
### 3) Sync Python workspace dependencies
From repo root:
```bash
make py-sync
```
### 4) Start SeaweedFS dependency stack (dev compose)
From repo root:
```bash
make compose-up-dev
```
SeaweedFS endpoints:
- S3 API: `http://localhost:8333`
- Master API: `http://localhost:9333`
- Filer API: `http://localhost:8888`
- Default S3 creds: `minioadmin` / `minioadmin`
### 5) Start Connect server (host process)
In a separate terminal, from repo root:
```bash
make run-server
```
`make run-server` behavior:
- loads `.env` automatically if present
- applies reasonable defaults when values are not set
- defaults S3 endpoint to `localhost:8333` for host-based development
- auto-normalizes `seaweedfs:8333` to `localhost:8333` for host runs
- supports optional `UVICORN_HOST` and `UVICORN_PORT` overrides
- exposes conversion timeout tuning vars (`CONVERSION_PPTX_TO_PDF_TIMEOUT_SECONDS`, `CONVERSION_PDF_TO_IMAGES_TIMEOUT_SECONDS`)
Server endpoint base URL:
- `http://localhost:8080`
### 6) Quick smoke test
Create a conversion request:
```bash
curl \
--header "Content-Type: application/json" \
--data '{
"sourceFilename":"example.pptx",
"full":{"resolution":"CONVERSION_RESOLUTION_FHD","jpeg":{"quality":85}},
"thumbnail":{"resolution":"CONVERSION_RESOLUTION_SD","jpeg":{"quality":75}}
}' \
http://localhost:8080/officeconvertapi.v1.ConversionService/CreateConversion
```
Then:
1. Upload the PPTX to the returned `uploadUrl` using HTTP `PUT`.
2. Call `StartConversion` with the returned `conversionId`.
3. Poll `GetConversionStatus` until `CONVERSION_STATUS_SUCCEEDED`.
4. Call `GetSlideDeck` and download each `imageUrl`.
5. Optionally call `DeleteConversion` for early cleanup.
### 7) Full container workflow (optional)
If you want to run both server and SeaweedFS in Docker:
```bash
make compose-up
```
Use `.env.example` as your baseline env configuration.
## Storage Backend Notes
- Local development defaults to **SeaweedFS** (S3-compatible) via Docker Compose. Compose runs an `s3-init` step that creates the dev bucket before the server starts.
- Production can use any S3-compatible provider; **AWS S3** is the expected choice.
- The Python server uses the `minio` Python SDK against the S3 API.
- Runtime configuration uses `S3_*` environment variables.
- All conversions share one bucket (`S3_BUCKET`, required). Each conversion's objects live under a `{conversion_id}/` key prefix (for example `{conversion_id}/input/source.pptx` and `{conversion_id}/output/slide-0001.jpg`).
### AWS setup
**Bucket**
1. Create one bucket (for example `officeconvert-prod`) in the region where the server runs.
2. Leave **Block Public Access** enabled. Presigned URLs work without a public bucket.
3. Optional: add a lifecycle rule to expire objects after a few days as a safety net if cleanup fails.
**Server environment**
Set at minimum:
```bash
S3_BUCKET=officeconvert-prod
S3_ENDPOINT=s3.us-east-1.amazonaws.com
S3_PUBLIC_ENDPOINT=s3.us-east-1.amazonaws.com
S3_REGION=us-east-1
S3_USE_SSL=true
S3_PUBLIC_USE_SSL=true
S3_ACCESS_KEY=...
S3_SECRET_KEY=...
```
Use your bucket's regional hostname for both endpoints unless you deliberately split internal vs client-facing access. `S3_PUBLIC_ENDPOINT` must be reachable by whatever uploads and downloads via presigned URLs (clients, not just the server).
On startup the server verifies the bucket exists via HeadBucket and fails fast if it is missing. **Pre-create the bucket** before deploying (see IAM below).
**IAM permissions**
Scope access to the single bucket. Object keys are per-conversion prefixes, so list/delete can target the whole bucket. Startup verification uses HeadBucket, which is satisfied by `s3:ListBucket` on the bucket ARN:
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:ListBucket", "s3:HeadBucket"],
"Resource": "arn:aws:s3:::officeconvert-prod"
},
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
"Resource": "arn:aws:s3:::officeconvert-prod/*"
}
]
}
```
**CORS**
Required only if uploads or downloads go **directly from a browser** to presigned URLs. Server-side clients (`curl`, the Go client) do not need CORS. Allow `PUT` and `GET` for your web origin on the bucket.
**IAM roles vs IAM users**
AWS recommends **roles** over long-lived **IAM user** access keys when the server runs on AWS compute (ECS, EC2, Lambda): a role grants **temporary** credentials that rotate automatically, with no static keys to store or leak.
For this project today, the server reads explicit `S3_ACCESS_KEY` and `S3_SECRET_KEY` via the MinIO SDK. That maps cleanly to:
| Where you run | Practical choice |
|---------------|------------------|
| Docker on a VPS, bare metal, or outside AWS | IAM **user** with the policy above; store keys in env or a secrets manager. Fine for a single service at low volume. |
| ECS / EC2 / EKS on AWS | Prefer an IAM **role** attached to the task or instance. Your orchestrator injects short-lived credentials; you still pass them into `S3_ACCESS_KEY` / `S3_SECRET_KEY` (and a session token if your runtime provides one — the server does not yet read a dedicated `S3_SESSION_TOKEN` env var). |
## Conversion Tuning Notes
If conversion fails on larger decks, tune these environment variables:
- `CreateConversionRequest.full.resolution` controls full-size output dimensions via presets: `SD`, `HD`, `FHD`, `QHD`, `UHD`.
- `CreateConversionRequest.thumbnail.resolution` controls thumbnail output dimensions with the same presets.
- Omitting full/thumbnail resolution (or sending `CONVERSION_RESOLUTION_UNSPECIFIED`) defaults to `FHD` for full and `SD` for thumbnail.
- Output is JPEG-only for now; set `CreateConversionRequest.full.jpeg.quality` and `CreateConversionRequest.thumbnail.jpeg.quality` to `1..100` (`0` or omitted uses server defaults: full `85`, thumbnail `75`).
- Rasterization DPI is inferred automatically from source slide size and selected full/thumbnail output dimensions.
- `CONVERSION_PPTX_TO_PDF_TIMEOUT_SECONDS` (default `180`): timeout for LibreOffice export.
- `CONVERSION_PDF_TO_IMAGES_TIMEOUT_SECONDS` (default `1800`): timeout for Poppler rasterization.