Preview: Technical guide #3

Multimodal input: How you turn handwriting, voice messages, WhatsApp, or public tender PDFs into ERP items

Detailed information about multimodal AI and data extraction technologies from any unstructured input (photo/audio/PDF) for faster sales.

LIVE

Guide #1 AI in Sales was launched on 21st Jan 2026. Read it here

The next guide will be published starting on April 22nd 2026.
You will receive the PDF file 48 hours before the official release.

Get the Full Manual

Why unstructured data costs time

$100

Cost per error

Average cost (US) borne by a company for not correcting a data entry error.

Source: 1-10-100 Rule ➛

99%

OCR accuracy

Character recognition rate for high-quality scanned documents.

Source: Intuition Labs ➛

>88%

Word-level accuracy

Error-free transcription at the individual-word level for clear audio, Google Speech-To-Text. Other models reach >94%.

Source: Soniox ➛

<5 sec

Transcription from image

Time needed to transcribe a list of 50 products from a handwritten image.

AI Sales demo

The technical problem: Photos/Audio/PDF vs. text-only ERP

Multimodal processing is critical both for sales and for digitizing accounting and operational documents. Multimodal technology can read invoices, delivery notes, or handwritten orders, can enable voice search, and can automatically insert them into the ERP. It reduces dramatically the need for transcription (data entry), delivering a software automation solution for operators.

Scenario: The client sends a photo of a broken part on WhatsApp and asks "Do you have something like this?".
ERP reality: The system requires an exact SKU code for search.
Manual procedure: The agent loses 20 minutes searching online, in PDF catalogs, or calling the warehouse.

Over 80% of business data is unstructured. Standard ERP systems ignore it, which is why data has to be transcribed, often with errors.

Join the priority list

The solution: LLM + OCR + Speech-to-text

An LLM-type AI like Google Gemini "sees" and "hears". On top of that, it can quickly process documents hundreds of pages long, like some public tenders. The automated flow using Google Cloud technologies like Vision and Speech To Text looks like this:

1. Ingestion: Taking in the image, audio file, or request for quotation (PDF, DOCX, etc.) directly from WhatsApp or email.
2. Transcription (OCR/STT): The system extracts raw text from the image or voice.
3. Semantic mapping: Products are identified in the database, e.g., through fuzzy matching, including quantity (e.g., "3 pieces", "3 pcs.", "3x").

Case study

Situation: An equipment distributor receives hundreds of orders via WhatsApp (photos of packaging, handwritten lists). Three employees decipher and transcribe them into the ERP.

Solution: The AI Sales software automatically identifies products and quantities, suggests the ERP codes, and the agent only validates.

Business impact: Processing time drops by 90%, allowing employees to focus on proactive sales.

Want to process any input? Join the list

The PDF guide ”Multimodal Input” will contain the full technical explanations.

Table of contents:

Google Vision API vs Gemini Pro. When we use each technology.
Handwriting processing. Accuracy examples including Romanian.
Audio processing. Transcribing phone calls and voice orders including Romanian.
Product mapping (Fuzzy Matching). How we go from text in a photo to the ERP code.
Processing large documents. Tender documents, supplier catalogs, etc.
Workflow & UI. What a split-screen interface for the agent looks like.