Russell Kalmoe

88%

LLM extraction accuracy

vs. 49% with legacy regex

10M+

documents processed

in ~2 weeks

~50%

cost savings

batch vs. real-time inference

The Real Story

The most important thing about this work isn't any individual project — it's the orchestration of a company-wide shift in how software is built in the age of AI.

My role in this is evangelist, experimenter, and change agent. I'm learning alongside the organization — actively experimenting with AI tools myself, pushing others to learn, and creating the structure that turns individual experiments into organizational capability. This isn't a top-down mandate. It's leading from the front.

AI is not magic. The gap between a compelling demo and production-grade, fault-tolerant, scalable software is wide. That gap is real in both directions: brownfield — weaving AI into existing systems not designed for it — and greenfield — new AI-native solutions that provide fast feedback loops on ideas before committing to production hardening.

What I've been building is structure for the change: a clear strategy for what to attack first, what value each initiative delivers, and how individual wins compound into durable platform capabilities.

The Three-Pillar Framework

About six months ago, I formalized our approach into an organizing framework:

AI Features — AI-powered capabilities delivered to customers and internal users
AI Platform — Shared infrastructure, tools, and governance that enable any team to build AI features without reinventing the stack
AI-Powered Engineering — Using AI to accelerate the software development lifecycle itself

This framework is the strategic container for everything below. It keeps the work legible to leadership, helps teams understand where their projects fit, and prevents the fragmentation that happens when every team builds their own AI stack independently.

The Journey: Classical ML → GenAI → Platform

Inherited Foundation (2023)

When I joined Porch Group, I took direct management of the ML team — machine learning engineers doing NLP-based text extraction from inspection report PDFs. They were mid-migration from self-managed infrastructure to Union Cloud (managed Flyte), which would give them scalable, observable workflow orchestration without maintaining the underlying infrastructure.

My early role was roadmapping, funding, and evangelism. I gave the team room to engineer while learning the domain — and I began translating what the ML team was building into business language that stakeholders could support.

What the classical ML work delivered:

Migrated all ML training pipelines to Union Cloud — simplified workflows, enabled cross-training, eliminated infrastructure maintenance burden
Replaced manual regex-based property attribute extraction (39% precision) with an AI-driven classification model — 85% precision across 7+ critical property attributes, 10x+ cheaper than comparable external LLM solutions
Built and iterated on computer vision models for property inspection — progressive improvement in model accuracy for critical safety feature detection
Delivered Porch's first image-only property attribute model — opening an entirely new category of AI capability for attributes that can't be reliably found in inspection text
Integrated Label Studio with LLM/GenAI workflows to accelerate training data labeling — 2x+ annotation throughput, cycle time from ~1 week to ~1 day

The GenAI Push

As LLMs matured, Porch made a company-wide push into generative AI. I was put in charge of scaling the AI effort across the organization. The first major initiative: a large-scale document intelligence platform — an LLM-powered system to extract structured data from 10M+ inspection report PDFs that had previously been processed with brittle, hard-to-maintain regex pipelines.

This project was a significant bet. It disrupted the existing roadmap. I secured leadership support by clearly articulating what the technology was, what value it would unlock, and what it would take to deliver reliably at scale.

What it delivered:

LLM-based structured extraction replacing legacy regex — more accurate, maintainable, and extensible
88% correct labeling vs. 49% for the regex approach it replaced
Batch processing at scale: 10M+ documents in ~2 weeks vs. months previously
~50% cost savings using batch LLM APIs vs. real-time inference

Prediction Hub: Build vs. Buy

The document intelligence project created a new problem: how do you call an LLM millions of times reliably, with observability, cost control, and proper data routing?

We evaluated off-the-shelf solutions including VertexAI. Rejected it — vendor lock-in risk and insufficient observability. We built an internal batch inference orchestration service instead.

Key technical decisions:

Re-architected infrastructure to enable GPU support
Fan-out (parallel scaling) architecture for high-volume batch workloads
Migrated data from Postgres to Firestore — enabling the higher I/O throughput required at scale (~3B records migrated)
Built retry logic, error-class handling, and model/prompt version abstraction into the platform from the start

The Platform Multiplier Effect

Once Prediction Hub was available, a direct report recognized an opportunity: apply the batch architecture to the computer vision inference pipeline, where CV inference calls had previously been made one-by-one.

Result: ~70% cost reduction. Latency from 2.5s to 40ms. Deployment time standardized to under 30 minutes.

This outcome wasn't planned as a Prediction Hub feature — it was a team member seeing an opportunity and using shared platform infrastructure. That's the platform multiplier: build something well once, and the organization finds uses for it you didn't anticipate.

AI Gateway: The Highest-Leverage Decision

As more teams began experimenting with AI independently, I recognized the risk: fragmented tooling, no shared observability, duplicated effort, inconsistent governance.

The architectural insight: all LLM calls across the company should route through a single integration point. We deployed Portkey as the AI Gateway — the highest-leverage decision in the entire platform strategy.

What the gateway provides:

Unified observability across all LLM usage
Cost attribution by team and use case
Vendor-agnostic routing — swap models without changing application code
Access to evaluation tools (Opik) and batch orchestration (Prediction Hub) downstream

The value proposition to other engineering teams: "Use this gateway, and you get observability, evaluation, and cost tracking for free — just build your feature."

Adoption is my responsibility. I'm the primary driver of cross-BU alignment — building relationships, explaining the platform, and securing buy-in one team at a time.

Observability: Dual-Track Monitoring

The right tool for the right signal:

| Layer | Tools | What It Covers | |---|---|---| | Infrastructure / SLOs | Datadog, Kubernetes probes | Latency, error rates, batch queue SLAs | | LLM / model quality | Portkey + Opik | Traces, cost, prompt A/B results, user feedback, quality alerts | | Audit / lineage | Pub/Sub → BigQuery | A/B test analyses, drift detection, training data expansion | | Alerts | Teams, PagerDuty | Service monitors, anomaly alerts, COE escalation |

What this looks like in practice: A July 2025 anomaly in prediction output rates — initially suspected as model drift — was investigated and traced to a code regression. Fix deployed same day. 47,420 affected requests reprocessed. Post-mortem produced new anomaly monitors and initiated design of dedicated ML drift monitors to distinguish pipeline breakages from true concept drift.

The discipline: not just dashboards, but monitors that codify learning from incidents.

Governance as Product

Governance at the platform level, not the policy level:

AI Gateway: Standardizes authentication, cost attribution, and observability for all LLM calls
Opik: Dataset versioning, prompt A/B testing, quality alerting, feedback loops
Batch inference service: Model/prompt version abstraction, retry/error isolation, audit trails
Self-hosted observability: Prompts and responses remain within company infrastructure — reduces privacy risk, keeps compliance and cost telemetry in the same trust boundary

For regulated domains: rollout gating (leadership-only pilots before broader exposure), regulatory test plans, auditability requirements, and legal disclaimers in UI from day one. The philosophy: governance as product behavior, not after-the-fact policy.

AI-Powered Engineering

Beyond product AI, I've driven adoption of AI tools across the software development lifecycle:

CodeRabbit: Initiated, piloted, and fully adopted — every merge request across Porch Corp Engineering is now AI-reviewed
AI-powered IDEs: JetBrains AI, Windsurf rolled out to all engineering teams across regions
Claude Code: Introduced for agentic development workflows
Impact: ~10 hours/week saved per active user; ~300K lines of code contributed via AI-assisted development (H2 2024)

The change management dimension matters: AI-powered engineering requires a mentality shift. Teaching engineers how to collaborate with AI tools, when to trust outputs and when to verify, how to design for AI-assisted workflows — that's ongoing organizational work.

What's Next: Agentic AI

Three tracks, all early-stage, all running in parallel:

Workflow automation (n8n): Self-hosted no-code platform for PM/Ops/CX users to build internal automations. LLM calls governed through the AI Gateway. "No-code under guardrails."
Agentic workflows (Union Cloud V2): Production runtime for both classical ML and agentic flows. We facilitated a cross-BU collaboration where another business unit ran LangGraph-based workflows on our infrastructure — that cross-pollination is informing our own direction. LangGraph is under active evaluation but hasn't been adopted by our team yet.
Autonomous coding agents: Narrowly scoped, supervised agents targeting low-risk, repetitive tasks — dependency upgrades, security patches, platform migrations. Active work: building a library of high-quality Claude skills and solving for reliable, observable hosted execution.

The philosophy across all three: supervised autonomy where blast radius is known — human gatekeepers, acceptance/rework metrics, AI review in the loop. Speed gains must not erode safety.

Key Metrics

| Metric | Value | |---|---| | Document scale | 10M+ PDFs processed via batch LLM extraction | | Batch processing time | Months → ~2 weeks | | ML classification accuracy | 85% precision (vs. 39% legacy regex) | | Batch cost savings | ~50% vs. real-time inference | | CV inference cost reduction | ~70% (platform multiplier) | | CV inference latency | 2.5s → 40ms | | AI dev savings | ~10 hrs/week per active user | | Data migration | ~3B records (Postgres → Firestore) | | Model cost vs. external LLM | 10x+ cheaper |

→ Back to Resume