Field Report Agent

The report writes itself on the drive back.

Our client's team visits buildings for a living. The visit is the easy part. The evening after the visit is where the job actually hurt: transcribing notes from memory, forcing them into the company's report format, resizing photos, dropping them into the right sections, fixing the layout the photos just broke. We built an agent that takes a voice note and a set of photos straight from the field and returns the finished report, in the exact format the client has always used. The inspector inspects. The agent types.

Hermes runtime Telegram intake Speech-to-text Document pipeline Human review

Every site visit produced the same debt: a report that had to exist by tomorrow. The findings were fresh in the inspector's head at the site, and stale by the time they sat down at a keyboard three visits later. Writing one report meant reconstructing observations from memory, typing them into a rigid template, and then the worst part, the photos: transferring them from the phone, picking the right ones, resizing them, placing each under the right section, and watching the document layout collapse anyway.

A single report cost around two hours of desk work for thirty minutes of actual field observation. The team was hired for their eyes and judgment, and they were spending most of their time being typists and layout technicians.

The format itself was not negotiable. Clients of our client receive these reports, and they expect the same structure, the same sections, the same look, every time. Whatever we built had to produce that exact document, not an approximation of it.

> capture

Standing at the site, the inspector sends the agent a voice message describing what was found, in normal spoken language, in whatever order it comes to mind. Photos of the building follow in the same chat.

> transcribe

The voice note is transcribed and cleaned. Spoken shorthand, repetitions, and "wait, back to the roof" corrections are resolved into coherent findings.

> structure

The agent maps the findings onto the client's report template. It knows the form: which sections exist, what belongs where, which fields are mandatory. A remark about the facade lands in the facade section no matter when it was said.

> illustrate

Photos are analyzed, matched to the findings they document, resized, and placed under the correct sections with captions. The layout survives, every time.

> ask

If a mandatory section has no finding and no photo, the agent asks the inspector directly in the chat, while they are still near the site and can still check.

> deliver

The finished document comes back to the inspector for review. A human reads it, corrects what needs correcting, and signs off. The agent drafts the report, the inspector owns it.

Does

  • Turns one voice note and a photo set into a complete, formatted report
  • Enforces the client's exact template, sections, and layout, every time
  • Matches photos to findings and captions them automatically
  • Asks follow-up questions when a mandatory section is empty
  • Archives every report, transcript, and photo set for later retrieval

Does not

  • Invent findings, if it was not said or photographed, it is not in the report
  • Soften or editorialize observations, the inspector's assessment is quoted, not rewritten
  • Send anything to the end customer, review and sign-off are human, always
  • Discard raw material, the original voice note and photos are stored next to the report

Layer I · Visual Architecture

One diagram: voice and photos in, transcript, template mapping, document assembly, review, archive. The review step sits inside the diagram, not after it. The system was designed around the inspector's sign-off, not despite it.

Layer II · Contracts

The report template is the contract. Every section, mandatory field, photo placement rule, and caption format was extracted from the client's existing reports and written down before any code. The agent does not have an opinion about the format, it has an obligation to it.

Layer III · Technical Diagrams

Transcription flow, findings-to-section mapping, photo analysis and matching, the follow-up question logic, and document assembly, specified in advance. The photo-matching step got the most design attention, because a photo under the wrong heading is the kind of error that destroys trust in the whole document.

Layer IV · Implementation

Hermes runtime on a dedicated isolated instance. Telegram as the field intake channel, because it is already on every inspector's phone. Speech-to-text for transcription, a python-docx document pipeline for assembly, Supabase for report history and the photo archive.

runtime        Hermes
intake         Telegram (voice notes + photos from the field)
speech         speech-to-text with cleanup pass
documents      python-docx assembly, fixed client template
state          Supabase (reports, transcripts, photo archive)
oversight      inspector review and sign-off before delivery

Bad audio

Wind, traffic, machinery. Low-confidence transcript segments are flagged in the draft instead of silently guessed, so the inspector sees exactly which sentence to double-check.

Wrong photo, wrong section

Photo matching is conservative. When the agent cannot confidently tie a photo to a finding, it asks instead of placing it. A misplaced photo is worse than a question.

Missing mandatory fields

The template contract defines what a complete report is. The agent chases missing pieces while the inspector is still on site, not the evening after, when checking means driving back.

Spoken chaos

Field speech is not dictation. Findings arrive out of order, interrupted, and corrected mid-sentence. The structuring step is built for that reality, the report reads as if it was written calmly at a desk, because for the first time, nobody had to.

Format drift

The template lives in one place. When the client updates their format, the contract is updated once and every future report follows, no retraining of habits, no old versions circulating.

~2 hrs

what one report used to cost at a desk

< 10 min

review and sign-off per report now

5 min

average voice note length per site

100%

reports in the exact client format since launch

same day

delivery standard, previously next day or later

Figures from the first three months in production, compared against the client's own time tracking.

The template is knowledge, not formatting. What looked like a layout problem was actually institutional knowledge trapped in a document format. Writing the template down as a contract forced the client to make implicit rules explicit, and the agent now enforces standards that previously depended on who wrote the report and how tired they were.

Ask while they are still there. The single most valuable behavior is the follow-up question sent while the inspector is still on site. A missing photo costs ten seconds to fix in the field and a full return trip the next day. The agent's timing is worth more than its writing.

Voice is the interface field workers actually want. Nobody on a construction site wants an app with forms. A voice note to a chat they already use is zero new habit, and adoption was immediate because the tool met the team where they already were.

Your team's evenings are going into reports.

> ../book_a_call.sh