The report writes itself on the drive back.
Our client's team visits buildings for a living. The visit is the easy part. The evening after the visit is where the job actually hurt: transcribing notes from memory, forcing them into the company's report format, resizing photos, dropping them into the right sections, fixing the layout the photos just broke. We built an agent that takes a voice note and a set of photos straight from the field and returns the finished report, in the exact format the client has always used. The inspector inspects. The agent types.
Hermes runtime Telegram intake Speech-to-text Document pipeline Human review
Every site visit produced the same debt: a report that had to exist by tomorrow. The findings were fresh in the inspector's head at the site, and stale by the time they sat down at a keyboard three visits later. Writing one report meant reconstructing observations from memory, typing them into a rigid template, and then the worst part, the photos: transferring them from the phone, picking the right ones, resizing them, placing each under the right section, and watching the document layout collapse anyway.
A single report cost around two hours of desk work for thirty minutes of actual field observation. The team was hired for their eyes and judgment, and they were spending most of their time being typists and layout technicians.
The format itself was not negotiable. Clients of our client receive these reports, and they expect the same structure, the same sections, the same look, every time. Whatever we built had to produce that exact document, not an approximation of it.
> capture
Standing at the site, the inspector sends the agent a voice message describing what was found, in normal spoken language, in whatever order it comes to mind. Photos of the building follow in the same chat.
> transcribe
The voice note is transcribed and cleaned. Spoken shorthand, repetitions, and "wait, back to the roof" corrections are resolved into coherent findings.
> structure
The agent maps the findings onto the client's report template. It knows the form: which sections exist, what belongs where, which fields are mandatory. A remark about the facade lands in the facade section no matter when it was said.
> illustrate
Photos are analyzed, matched to the findings they document, resized, and placed under the correct sections with captions. The layout survives, every time.
> ask
If a mandatory section has no finding and no photo, the agent asks the inspector directly in the chat, while they are still near the site and can still check.
> deliver
The finished document comes back to the inspector for review. A human reads it, corrects what needs correcting, and signs off. The agent drafts the report, the inspector owns it.
Layer I · Visual Architecture
One diagram: voice and photos in, transcript, template mapping, document assembly, review, archive. The review step sits inside the diagram, not after it. The system was designed around the inspector's sign-off, not despite it.
Layer II · Contracts
The report template is the contract. Every section, mandatory field, photo placement rule, and caption format was extracted from the client's existing reports and written down before any code. The agent does not have an opinion about the format, it has an obligation to it.
Layer III · Technical Diagrams
Transcription flow, findings-to-section mapping, photo analysis and matching, the follow-up question logic, and document assembly, specified in advance. The photo-matching step got the most design attention, because a photo under the wrong heading is the kind of error that destroys trust in the whole document.
Layer IV · Implementation
Hermes runtime on a dedicated isolated instance. Telegram as the field intake channel, because it is already on every inspector's phone. Speech-to-text for transcription, a python-docx document pipeline for assembly, Supabase for report history and the photo archive.
runtime Hermes intake Telegram (voice notes + photos from the field) speech speech-to-text with cleanup pass documents python-docx assembly, fixed client template state Supabase (reports, transcripts, photo archive) oversight inspector review and sign-off before delivery
Bad audio
Wind, traffic, machinery. Low-confidence transcript segments are flagged in the draft instead of silently guessed, so the inspector sees exactly which sentence to double-check.
Wrong photo, wrong section
Photo matching is conservative. When the agent cannot confidently tie a photo to a finding, it asks instead of placing it. A misplaced photo is worse than a question.
Missing mandatory fields
The template contract defines what a complete report is. The agent chases missing pieces while the inspector is still on site, not the evening after, when checking means driving back.
Spoken chaos
Field speech is not dictation. Findings arrive out of order, interrupted, and corrected mid-sentence. The structuring step is built for that reality, the report reads as if it was written calmly at a desk, because for the first time, nobody had to.
Format drift
The template lives in one place. When the client updates their format, the contract is updated once and every future report follows, no retraining of habits, no old versions circulating.
~2 hrs
what one report used to cost at a desk
< 10 min
review and sign-off per report now
5 min
average voice note length per site
100%
reports in the exact client format since launch
same day
delivery standard, previously next day or later
Figures from the first three months in production, compared against the client's own time tracking.
The template is knowledge, not formatting. What looked like a layout problem was actually institutional knowledge trapped in a document format. Writing the template down as a contract forced the client to make implicit rules explicit, and the agent now enforces standards that previously depended on who wrote the report and how tired they were.
Ask while they are still there. The single most valuable behavior is the follow-up question sent while the inspector is still on site. A missing photo costs ten seconds to fix in the field and a full return trip the next day. The agent's timing is worth more than its writing.
Voice is the interface field workers actually want. Nobody on a construction site wants an app with forms. A voice note to a chat they already use is zero new habit, and adoption was immediate because the tool met the team where they already were.
Your team's evenings are going into reports.
> ../book_a_call.sh