Generative AI has changed the very nature of how we think about data

Data has been at the core of technology for some time, but generative AI really pushed it to the heart of the discussion and changed how we think about the importance of data in a fundamental way.
As I wrote in my ForwardThinking commentary recently, Garbage in/garbage out applies more than ever when it comes to generative AI:
Whenever I talk to people who are working with enterprise customers about AI, they inevitably need to talk about data first. That’s because the quality of the models depends on the quality of the data used to train them. The phrase garbage in, garbage out goes back decades for a reason.
Box CEO Aaron Levie says that when it comes to building an AI technology stack, asking the right questions about your data is a big part of that. “Is your data ready and in a format and in an environment that lets you take advantage of AI,” Levie asked.
Teresa Tung, senior managing director in charge of AI and data products at Accenture, agrees. Her firm has worked with thousands of companies implementing AI projects, and she says that almost half of all companies they speak to at Accenture, and they speak to a lot of organizations, don’t have data that’s ready for generative AI. “We have found that 48% of companies who want to do generative AI or AI projects, say they can't because their data is not ready,” Tung told FastForward.
She says that the data discussion has become so critical that they are talking to higher level executives about it now. Whereas she previously would have spoken to a data officer or perhaps a CTO or CIO, today she is having those conversations with the CEO because of how important data has become to every organization. “It's raised the agenda and lifted it beyond a tech project or tech subject to strategic discussion,” she said.
Are you ready?
A recent survey by Kearney and Futurum of 200 CEOs found that “nearly two-thirds of CEOs cite disconnected or low-quality data as the main barrier preventing AI solutions from scaling beyond pilot phases, underscoring how critical robust data readiness is to any AI initiative.”
The other issue is the kinds of data we’re using to train AI models, and it’s not just structured data sitting in neat tables in databases. It’s also unstructured data that’s often messy, where it’s hard to identify the definitive document you want the model to use as the source of truth.
Part of the problem, Tung says, is that we have systems in place to deal with structured data, and we have to reproduce those systems with unstructured data. That includes a data model, a data pipeline and experience dealing with that kind of data in production at scale. When it comes to unstructured data, most companies are not equipped to build an equivalent data stack.
“As we pivoted from structured data to now thinking about unstructured data, we have to go back and think about whether we have the equivalent data supply chain,” Tung said. And she said, you have to govern and curate that unstructured data. Is ‘Presentation Final V2’ the actual final version of the presentation, and if it’s not, and you’re using it to train your model, you might not get accurate results.
Even if the data is in shape to train the model, you need enough data, and for some companies, that’s where the problem lies. “You might not even have data to validate your models, so not even fine tuning a model, but just validating this model. Is it accurate for what I need?” She says in these cases you might want to consider synthetic data. “We've used synthetic data [when required], but that muscle isn't built across the data industry,” she said.
In its Data Readiness report, published last fall, Accenture found, “Synthetic data—created by algorithms, rather than collected from actual life events—addresses the growing scarcity of specialized datasets, enabling companies to explore multiple scenarios without the extensive costs associated with collecting traditional data.”
As companies attempt to take advantage of generative AI, they need to understand that having the proper amount of data in the correct format with the right governance is job one, and until you get there, you can’t take full advantage of the technology.
Photo by Ales Nesetril on Unsplash