OCR Discovery Questions


✅OCR on GCP - some things to uncover

1. About the Forms Themselves

Form Inventory & Characteristics

  • Can you provide samples of all six forms (blank + completed)?
  • Are these forms fixed layout, semi-structured, or unstructured?
  • Do the layouts vary by version, region, or revision year?
  • How frequently do the forms change?

Input Quality

  • What formats will documents arrive in? (PDF, images, scans, photos)
  • What is the average scan resolution and quality?
  • Are the forms typed, handwritten, or a mix?
  • Will images need pre-processing (deskew, noise removal, contrast enhancement)?

2. Data Fields & Extraction Requirements

Field Definitions

  • What specific fields must be extracted from each form?
  • Are some fields mandatory? Optional?
  • Are certain fields more critical (higher accuracy requirements)?

Validation Rules

  • Are there constraints on the fields? (e.g., date format, numeric ranges, dropdown lists, ID patterns)
  • Do any fields require cross-field validation? (e.g., total = sum of parts)

3. Volumes, Throughput, and SLAs

  • How many documents per day/week/month?
  • Are there peak loads?
  • What is the required processing time per document?
  • What is the acceptable failure or exception rate?

4. Workflow & Integration

Document Ingestion

  • How will documents be uploaded or received? (email, SFTP, cloud bucket, application upload)
  • Do they require classification between the six form types?

Downstream Use

  • Where should the extracted data go? (database, API, workflow system, data lake, etc.)
  • Do you need PDFs split, renamed, indexed?

Exception Handling

  • How should you handle forms with extraction errors?
  • Does a human need to review low-confidence fields?
  • Do you need an interface for manual correction?

5. Accuracy, Confidence, and Tuning

  • What minimum accuracy do you expect per field?
  • Should the OCR system flag values below a confidence threshold?
  • Do you want continuous model improvement based on corrected samples?

6. Security, Privacy & Compliance

  • Do the forms contain sensitive data (PII, PHI, financial data)?
  • What compliance standards apply? (HIPAA, GDPR, SOC2, internal policies)
  • Who should have access to the extracted data and images?
  • Do documents need to be retained or purged after processing?

7. Technical Environment & Constraints

  • Will this run on-prem, cloud, or hybrid?
  • Are you using a specific cloud provider? (GCP, AWS, Azure)
  • Do you prefer a specific OCR engine? (Google Document AI, AWS Textract, Azure Forms Recognizer, Tesseract, etc.)
  • Any existing systems that this solution must integrate with?

8. Reporting, Metrics & Audit

  • What metrics should be monitored? (processing time, accuracy, failure rate, volumes)
  • Do you need audit logs of extracted fields?
  • Should corrections be traceable?

9. User Experience

  • Who will use the system (ops, analysts, customers)?
  • Do you need a dashboard or UI?
  • Should users be able to upload, review, correct, approve?

10. Project Constraints

  • What is the timeline?
  • What budget constraints exist?
  • How will acceptance testing be defined?