Overview

This page lists the exact steps you should follow to create, train, deploy, and integrate a custom processor in Google Cloud Document AI that extracts structured data from your forms.

Step-by-step process

1. Collect sample documents

  • Gather representative examples: 20–200 documents depending on complexity (minimum ~10).
  • Include layout variants, handwritten/typed, different languages, filled & empty forms, edge cases.

2. Create a dataset in Document AI

  • Console: Document AI → Datasets → Create Dataset.
  • Choose dataset type: Custom Extractor and select region.

3. Upload documents

  • Methods: drag-and-drop in console, link Cloud Storage bucket, or use the API for bulk ingestion.
  • Verify file types and page ordering (PDFs, TIFFs, PNG, JPEG).

4. Define entities (schema)

  • Decide the fields to extract (e.g., PatientName, PolicyNumber, InvoiceTotal).
  • For each entity set: name, description, data type (text, date, number, enum, boolean).

5. Label documents

  • Use Document AI's labeling tool to highlight text, mark checkboxes, annotate tables.
  • Label consistently; if a field is absent, mark it as not-present.
  • Best practice: label every entity on each page where it occurs.

6. Train the custom extractor

  • Start training from the dataset UI: choose train/test split and begin.
  • Training yields metrics: precision, recall, F1 and per-field accuracy.

7. Validate & iterate

  • Inspect errors, add more labels for edge cases, fix inconsistent annotations, and retrain.
  • Repeat until per-field accuracy and overall F1 meet your acceptance criteria.

8. Deploy the processor

  • Deploy a model version to get a Processor ID (example: projects/PROJECT/locations/us/processors/PROCESSOR_ID).

9. Integrate with your app

  • Call the processor via REST or client libraries (Python, Node, Java, Go).
  • Common architecture: upload to Cloud Storage → Cloud Function triggers processing → store structured output.

10. Store results

  • Options: BigQuery (analytics), Firestore/Datastore (document records), Cloud SQL (relational), Cloud Storage (raw JSON), Document AI Warehouse (searchable archive).

11. Automate pipelines (optional)

  • Typical flow: GCS upload → Pub/Sub/Cloud Function → DocAI process → Write to BigQuery / DB → Trigger downstream workflows.
  • Use Workflows or Cloud Run for advanced orchestration and retries.

12. Monitor & retrain

  • Track performance over time; add new samples as layouts or business rules change.
  • Version processors and perform A/B tests if needed.