Build Log: Ohmyword — Who Owns the Data?
I'm building a Serbian language learning app called Ohmyword. Early on, I made an architecture decision that felt almost too simple: two tables. One is the source of truth — the dictionary entry with all its linguistic metadata. The other is a search cache — every inflected form a user might type, pointing back to the source.
It took me about a week to realize this wasn't a database decision. It was an ownership decision. And it's the same decision that determines whether enterprise AI deployments succeed or quietly die.
The Problem a Flashcard App Taught Me
Serbian has 7 grammatical cases. A single noun like "pas" (dog) can take up to 14 different forms depending on its role in a sentence. "Psa," "psu," "psom" — these are all the same word. A learner who encounters "psa" in a text needs to be able to search for it and find their way back to "pas."
My first instinct was to store everything in one table. One row per form, with metadata attached. Simple.
But then the questions started:
If I update the English translation of "pas," do I need to update all 14 forms? What happens when I realize a declension is wrong — is the fix in one place or scattered across 14 rows? When I eventually build a rule engine to auto-generate forms, what does it overwrite and what does it leave alone?
The answer was to separate what's authoritative from what's derived. The vocabulary_words table is the source of truth — one row per word, with gender, animacy, verb aspect, all the metadata that defines how that word behaves. The search_terms table is generated from it. It can be rebuilt, corrected, expanded. It even has a source field that tracks whether a form was manually entered, seeded, or engine-generated, and a locked flag to prevent auto-updates from overwriting manual corrections.
36 base words produced 411 searchable forms. The source of truth is small and manageable. The derived layer is large but disposable.
That's a clean little architecture for a personal project. But the reason it's clean is that I can point to exactly which table is authoritative for what. The vocabulary_words table owns linguistic metadata. The search_terms table owns nothing — it's a cache. That boundary is explicit, and everything follows from it.
In enterprise environments, those boundaries rarely exist. Not because sources of truth are missing, but because nobody has mapped which system is authoritative for which data.
What This Looks Like at Enterprise Scale
Consider a common scenario in any large organization that manages physical assets — infrastructure, equipment, facilities — anything that needs to be inspected, maintained, and tracked.
The canonical record for an asset almost never lives in one place. Customer and contract information sits in a CRM. Operational data — maintenance history, field status, inspection results — lives in a separate system, often managed by a different team. Legal ownership records might come from a third system, sometimes managed by an external partner. Scheduling, compliance, and financial data each have their own homes.
Nobody plans this. It's the natural result of different teams solving different problems at different times with different tools. And here's the thing that makes it deceptive: each of these systems is a source of truth — for its own slice of the data. The CRM is authoritative for customer and contract information. The operational database is authoritative for maintenance history and field status. The partner system is authoritative for legal ownership records.
The problem isn't that no source of truth exists. The problem is that no one has a map.
Nobody has written down: "For decommission status, the CRM wins. For last maintenance date, the ops database wins. For legal ownership, the partner system wins." That map — which system is authoritative for which question — doesn't exist. It lives informally in the heads of a few people who've been around long enough to know who to ask.
Now imagine someone in that organization says: "Let's build an AI agent that can answer questions about our assets."
The technical team starts building. RAG pipeline, embeddings, tool use, structured output — all the patterns that work beautifully in a demo. And then they hit the questions that expose the missing map:
"Which system do we query when data conflicts?" The CRM says an asset was decommissioned. The operational database says it had a maintenance visit last month. Both are "correct" within their own context. Without a map that says which system is authoritative for which question, the AI agent has no way to adjudicate. And neither does the team building it.
"Who is allowed to update what?" If the agent surfaces a data quality issue — which it will, because AI is embarrassingly good at finding inconsistencies — whose job is it to fix it? You can't answer that question without first answering: which system is the authority for this specific piece of data?
"What happens when we can't get access?" The system that holds the data the agent needs belongs to another team, another department, sometimes another organization entirely. Different security reviews, different approval processes, different data classification. Projects stall for weeks waiting on access requests — and the need for that access only becomes clear because no one mapped the data boundaries before the project started.
None of these are AI problems. They're data ownership problems. But they'll kill an AI project just the same.
The Organizational Problem Wearing a Technical Costume
When I built the two-table architecture for Ohmyword, I didn't just decide to have a source of truth. I decided which table was authoritative for what, and I made that decision explicit. The vocabulary_words table is authoritative for linguistic metadata. The search_terms table is authoritative for nothing — it's derived, it's rebuildable, and its role is clearly labeled. The decision was easy because I own both tables and there's no ambiguity.
At enterprise scale, the mapping work is harder but the questions are the same:
Who is the authority for each slice of data? Not "which database" — which person or team has the mandate to say "this is the correct record" for a specific type of information? If the answer is "it depends on who you ask," you don't have a missing source of truth — you have an unmapped one.
What's authoritative vs. what's derived? In my app, the search cache is explicitly derived. In enterprise environments, this distinction is rarely explicit. Teams build reports off derived data, make decisions based on cached aggregations, and treat downstream copies as if they're authoritative. When an AI agent joins the picture and starts reasoning across all of it, the inconsistencies become visible in ways they never were before.
What's the update contract? When the source of truth changes, what happens downstream? In my app, I can regenerate all 411 search terms from 36 base words in seconds. In a large organization, a change to the canonical asset record might need to propagate through five systems with five different update mechanisms — batch jobs, event streams, manual exports, and one system that just gets updated "when someone remembers."
Who resolves conflicts? My app doesn't have conflicts because the authority map is simple: one table owns the data, the other derives from it. Enterprise systems have conflicts constantly — not because the data is wrong, but because two systems are both "right" for different questions, and nobody has documented which system wins for which question. The resolution process is usually informal, undocumented, and different depending on who you ask.
The AI project doesn't create these problems. It reveals them. And it reveals them at the worst possible time — in production, in front of users, when the agent confidently presents an answer assembled from systems that disagree with each other.
Why This Matters Before the First Line of Agent Code
It's common for teams to invest months in a well-built AI agent — solid prompt engineering, a strong eval framework, reliable guardrails — only to have the deployment stall because the data layer underneath is unmapped.
The agent works perfectly in a demo environment where everyone agreed to use clean, unified test data. Then it hits production, where the CRM hasn't been reconciled with the operational database in months, and the results are wrong in ways that erode trust fast. Not because the data is bad, but because the agent doesn't know — and can't know — which system to trust for which question.
The lesson I keep coming back to, from building a language app and from thinking about this problem more broadly: the AI agent is only as trustworthy as the data authority map underneath it. If you can't point to which system is authoritative for which data, you don't have a technical problem to solve. You have a discovery problem to solve first.
And that discovery work — mapping which systems own which data, building consensus on authority boundaries, establishing update contracts between teams — that's not engineering work. It's delivery work. It's stakeholder alignment work. It's the work that happens in conference rooms and Slack threads and escalation emails before anyone writes a prompt.
The good news is that mapping is a fundamentally different ask than building. "Build a unified source of truth" is a massive infrastructure project that most organizations will never finish. "Map which systems are authoritative for which data" is a discovery exercise — still hard, politically and logistically, but scoped and achievable as a precursor to an AI deployment.
The Pattern I'd Use Again
The two-table pattern from Ohmyword scales conceptually, even if the implementation looks completely different at enterprise scale:
- Map the authority boundaries explicitly. Don't just name the system — name what it's authoritative for. "The CRM is the source of truth for customer contracts and decommission status. The ops database is the source of truth for maintenance history and field condition." Make it boring and unambiguous.
- Treat everything else as derived. If a downstream system has a copy, it's a cache. Label it as such. Build refresh mechanisms. Don't let anyone treat it as the primary record for data that belongs to another system.
- Track provenance. My search terms have a
sourcefield. Enterprise data should have the equivalent — where did this record come from, when was it last synced, and how confident are we in it? - Build the escape hatch. My
lockedflag prevents automated processes from overwriting manual corrections. At scale, you need the equivalent: a way for human experts to override the machine when the data is wrong, without breaking the automated pipeline. - Document the map before building the agent. If you can't produce a simple document that says "for question X, system Y is authoritative," you're not ready to build an AI agent that reasons across those systems. Do the mapping work first.
None of this requires AI expertise. It requires the willingness to do discovery work and have difficult conversations about data ownership before the exciting technical work begins.
Coming Full Circle
I built Ohmyword's two-table architecture because Serbian grammar forced me to think clearly about what's authoritative and what's derived. 36 words, 411 forms — the complexity demanded a clean separation and an explicit map of which layer owns what.
Enterprise organizations face the same underlying challenge at a much larger scale, with the added dimension that the "tables" are owned by different teams, governed by different policies, and sometimes maintained by entirely different organizations. The data authorities exist — they're just unmapped.
In both cases, the insight is the same: if you don't know which system is authoritative for which data, every layer you build on top — including your AI agent — inherits the confusion.
The best technical architecture in the world can't compensate for an organization that hasn't mapped who owns what.
This is the second in a series of posts about building Ohmyword and what small projects reveal about large ones. Previously: Why I Chose Elixir and Phoenix to Build a Serbian Language Learning App. You can check it out here: Ohmyword