Multimodal input: How you turn handwriting, voice messages, WhatsApp, or public tender PDFs into ERP items
Detailed information about multimodal AI and data extraction technologies from any unstructured
input
(photo/audio/PDF)
for faster sales.
The next guide will be published starting on March 25th 2026.
You will receive the PDF file 48 hours before the official release. Get the Full Manual
Why unstructured data costs time
Average cost (US) borne by a company for not correcting a data entry error.
Character recognition rate for high-quality scanned documents.
Error-free transcription at the individual-word level for clear audio, Google Speech-To-Text. Other models reach >94%.
Time needed to transcribe a list of 50 products from a handwritten image.
The technical problem: Photos/Audio/PDF vs. text-only ERP
Multimodal processing is critical both for sales and for digitizing accounting and operational documents. Multimodal technology can read invoices, delivery notes, or handwritten orders, can enable voice search, and can automatically insert them into the ERP. It reduces dramatically the need for transcription (data entry), delivering a software automation solution for operators.
ERP reality: The system requires an exact SKU code for search.
Manual procedure: The agent loses 20 minutes searching online, in PDF catalogs, or calling the warehouse.
Over 80% of business data is unstructured. Standard ERP systems ignore it, which is why data has to be transcribed, often with errors.
Join the priority list
The solution: LLM + OCR + Speech-to-text
An LLM-type AI like Google Gemini "sees" and "hears". On top of that, it can quickly process documents hundreds of pages long, like some public tenders. The automated flow using Google Cloud technologies like Vision and Speech To Text looks like this:
- 1. Ingestion: Taking in the image, audio file, or request for quotation (PDF, DOCX, etc.) directly from WhatsApp or email.
- 2. Transcription (OCR/STT): The system extracts raw text from the image or voice.
- 3. Semantic mapping: Products are identified in the database, e.g., through fuzzy matching, including quantity (e.g., "3 pieces", "3 pcs.", "3x").
Case study
Situation: An equipment distributor receives hundreds of orders via WhatsApp (photos of packaging, handwritten lists). Three employees decipher and transcribe them into the ERP.
Solution: The AI Sales software automatically identifies products and quantities, suggests the ERP codes, and the agent only validates.
Business impact: Processing time drops by 90%, allowing employees to focus on proactive sales.
Want to process any input? Join the list
The PDF guide ”Multimodal Input” will contain the full technical explanations.
- Google Vision API vs Gemini Pro. When we use each technology.
- Handwriting processing. Accuracy examples including Romanian.
- Audio processing. Transcribing phone calls and voice orders including Romanian.
- Product mapping (Fuzzy Matching). How we go from text in a photo to the ERP code.
- Processing large documents. Tender documents, supplier catalogs, etc.
- Workflow & UI. What a split-screen interface for the agent looks like.
You will receive the PDF guides by email 48h before the official release.
Next guide: Hybrid Search & Assistants (Chatbots)
Schedule a Meeting
News and Guides



