Custom Forms and Doc AI on GCP
Overview
This page lists the exact steps you should follow to create, train, deploy, and integrate a custom processor in Google Cloud Document AI that extracts structured data from your forms.
Step-by-step process
1. Collect sample documents
- Gather representative examples: 20–200 documents depending on complexity (minimum ~10).
- Include layout variants, handwritten/typed, different languages, filled & empty forms, edge cases.
2. Create a dataset in Document AI
- Console:
Document AI → Datasets → Create Dataset. - Choose dataset type: Custom Extractor and select region.
3. Upload documents
- Methods: drag-and-drop in console, link Cloud Storage bucket, or use the API for bulk ingestion.
- Verify file types and page ordering (PDFs, TIFFs, PNG, JPEG).
4. Define entities (schema)
- Decide the fields to extract (e.g.,
PatientName,PolicyNumber,InvoiceTotal). - For each entity set: name, description, data type (text, date, number, enum, boolean).
5. Label documents
- Use Document AI's labeling tool to highlight text, mark checkboxes, annotate tables.
- Label consistently; if a field is absent, mark it as not-present.
- Best practice: label every entity on each page where it occurs.
6. Train the custom extractor
- Start training from the dataset UI: choose train/test split and begin.
- Training yields metrics: precision, recall, F1 and per-field accuracy.
7. Validate & iterate
- Inspect errors, add more labels for edge cases, fix inconsistent annotations, and retrain.
- Repeat until per-field accuracy and overall F1 meet your acceptance criteria.
8. Deploy the processor
- Deploy a model version to get a Processor ID (example:
projects/PROJECT/locations/us/processors/PROCESSOR_ID).
9. Integrate with your app
- Call the processor via REST or client libraries (Python, Node, Java, Go).
- Common architecture: upload to Cloud Storage → Cloud Function triggers processing → store structured output.
10. Store results
- Options: BigQuery (analytics), Firestore/Datastore (document records), Cloud SQL (relational), Cloud Storage (raw JSON), Document AI Warehouse (searchable archive).
11. Automate pipelines (optional)
- Typical flow:
GCS upload → Pub/Sub/Cloud Function → DocAI process → Write to BigQuery / DB → Trigger downstream workflows. - Use Workflows or Cloud Run for advanced orchestration and retries.
12. Monitor & retrain
- Track performance over time; add new samples as layouts or business rules change.
- Version processors and perform A/B tests if needed.
Leave a Reply