CTO AI Corner: How much does data quality matter for AI services?

For a long time in machine learning, one truth has been repeated almost like a mantra: data quality is everything. In many cases, the first step in any project has been to clean and structure the data before even attempting a solution. But I am starting to question whether that still holds with modern general-purpose AI models.

AI cannot magically interpret data out of thin air, but it can handle messy inputs surprisingly well. It can work with incomplete context, inconsistent formats, and even partially incorrect data in ways that older approaches simply could not.

My previous mental model was that AI could only match what a human could infer from the same dataset. In other words, if you could fully document a human expert’s tacit knowledge, AI might reach that level. I am gradually revising that assumption.

Rethinking the role of data quality

General-purpose AI brings broad contextual knowledge that often goes beyond a single human’s expertise. For example, terminology differences between industries can confuse people who are deeply specialized in one domain. AI, on the other hand, does not really care about those boundaries. It can often infer meaning across contexts without getting stuck on unfamiliar phrasing.

Then there is patience. If guided properly, AI does not get tired or lose focus. Where a human might start skimming by page five, AI will stay just as attentive on page five hundred. That alone can reduce missed details.

It also performs well in areas where humans rely on educated guesswork. Interpreting misspelled names, identifying misplaced information, or reconstructing intent from imperfect data are all tasks where AI can be surprisingly effective.

So, my updated view is this: with proper instructions, AI can often perform at least as well as a human when working with messy data, and sometimes even better.

That raises an interesting question. If humans can already manage the process with imperfect data, do we really need to prioritize cleaning it before trying to automate?

Improving data quality is still valuable, especially when it is easy to do. But it may no longer need to be the automatic first step. It might be more efficient to first test whether the data actually limits the outcome.

May 6, 2026
ai-corner
Authors
Tomi Leppälahti
CAIO & CTO
Share

Pohdituttaako AI-​​​​​​​​​​​​​​​​asiat? Jätä viesti ja kartoitetaan yhdessä, miten ja missä hyödyntää tekoälyä.

Kiitos viestistäsi! Olemme pian yhteydessä.
Hupsis! Jotain meni pieleen lomakkeen lähetyksessä.