July 25, 2025 Steve Hackbarth
A key part of Andrej Karpathy’s recent talk at Y Combinator’s AI Startup School involved the challenges and opportunities of building software in the context of powerful but inherently flawed large language models. Karpathy’s advice — that AI works best when its output is limited in scope and rapidly reviewed by humans — tracks with our experience so far in integrating AI tooling into our workflow.
For all the talk of agentic AI and the replacement of human labor, we’ve found that AI is much more practical as an empowerment tool for our engineers. In Karpathy’s formulation: it’s more of the Iron Man suit that’s responsive to Tony Stark’s commands, and less of the Iron Man suit that flies around by itself saving the heroes.
At Ursa Health we straddle the line between software product company and tech-enabled consultancy, and our customers expect us to deliver data and analytics solutions to answer the key questions facing their business. The source data is invariably grisly, the transformations are difficult, and maintaining a high standard of trusted output is non-negotiable.
It's a fraught environment to introduce AI tooling into, and we’ve taken care to develop a system to harness and control the power that comes with these LLMs. We call it the “eager intern” model.
The Challenges of Healthcare Data Integration
In our work, the data journey typically starts with raw source data, such as a claims data package that has been extracted from a health plan’s transactional system. Our first step then is the interrogation of this source data to understand its idiosyncrasies- what the fields are, what they mean, and what quirks exist in the data.
To take one example: a data dictionary might be able to tell us if the claim ID column is named clm_id, claimnum, or something else, but it won’t tell us whether this field really meets our standard of being a claim ID. In our definition of the term, a claim ID needs to uniquely identify a claim. So, if there are several rows that are for the same claim, they need to have the same claim ID; if they are not for the same claim, they need to have a different claim ID.
It sounds basic, but when health plans send us data packages, the field that purports to be the claim ID is frequently not a claim ID at all — it might be more of a claim transaction ID, or something yet different.
To determine if the putative claim ID can function as a claim ID per our definition — and if it can’t, to construct a claim ID that can — is one of hundreds of tasks that our data engineers need to complete to conform a novel data package to our core data model. Some of these tasks can be informed by eyeballing a screenful of data with the appropriate sort applied; some require aggregate queries to canvas the full scope of each data table.
The Verification Loop in Healthcare Data
From this context, it’s clear to see the promise of AI-based tooling to assist our engineers in their work. The data is messy in countless ways and checking around every corner for gremlins is thankless work. Meanwhile, LLMs are great at writing SQL, and can churn through thankless tasks very quickly.
That said, Karpathy's keynote emphasized something we've believed from the beginning: AI systems need robust human review and validation. This isn't just a nice-to-have in healthcare data — it's an imperative. When you're dealing with patient data that drives clinical decisions and financial outcomes, trust isn't optional.
The key to implementation is to leverage the capabilities of the AI, while keeping the human firmly in the driver’s seat.
The Eager Intern at Work
What this looks like in practice is that we’ve decomposed our work into the smallest-scoped pieces possible and set the LLM to work on these tasks. Like an eager intern, the AI is enthusiastic and fast. It writes SQL queries in seconds, synthesizes information, and responds with confidence.
But also like an intern, it needs supervision. We tell the LLM that its job is to convince a skeptical engineer that it has answered the question correctly, so as to keep it diligent about marshalling its evidence. The eager intern is only entitled to make assertions; it’s still the role of the engineer to pass judgment on each assertion based on the quality of the presented evidence.
It’s a funhouse mirror version of how senior engineers have long worked with inexperienced team members. You give them a specific task, they show you their approach and findings, and you verify their work before moving forward. The difference is that this intern works at superhuman speed, never gets tired of the tedious stuff, and is always at risk of outlandish fabrication. This makes it both possible and essential that the feedback loops be kept as small as possible.
The Trust Dividend
Healthcare is an industry awash in data and reporting requirements, which are used to generate quality measures and other analytics that nearly everyone ignores. This might give the impression that healthcare as an industry is not particularly amenable to using data or is already saturated with this kind of information. In fact, we've found that the opposite is true. What we’ve found are organizations, teams, and individuals practically starving for meaningful feedback.
Mistakes in the interpretation and transformation of data across systems are easy to make, can be difficult to detect, and often have catastrophic downstream consequences for the integrity of the analysis as a whole. No amount of excellent analytics or visualizations can solve the problem of garbage-in-garbage-out.
Trust is our most important asset, so the accuracy of our work has always been paramount. From this perspective, it’s no wonder that we’ve been hesitant to hop on the AI bandwagon. Confidently-stated nonsense in healthcare data analytics is the blight that we started Ursa Health to combat; it’s also a hallmark of the current generations of large language models.
What we’ve found, however, is that harnessing AI through the eager intern model unlocks a level of rigor that had previously been difficult to maintain. This isn't about replacing judgment — it's about accelerating the path to good judgment. In the next piece I’ll describe how we pair a venerable healthcare institution – the checklist – into our tooling to organize the partnership between the engineer and eager AI intern.
About the Author
Steve Hackbarth, Chief Technology Officer
Before joining Ursa Health, Steve was head of development at xTuple, the world's #1 open-source ERP software company, managing a diverse development team that spanned four continents. Before that, he founded Speak Logistics, a tech startup borne out of his experience in the transportation industry, which introduced the user-friendly and modern sensibilities of the consumer Internet into the enterprise space. His professional passions include JavaScript, open-source software, continuous integration, and scalable code.
Steve holds a bachelor's degree in computer science from Harvard University and an MBA from William and Mary. He is a frequent speaker on subjects such as JavaScript, modular architecture, git, open source, and asynchronous programming.