Skip to main content

Data is the fuel of artificial intelligence. It is also a bottleneck for big businesses, because they are reluctant to fully embrace the technology without knowing more about the data used to build AI programs.

Now a consortium of companies has developed standards for describing the origin, history and legal rights to data. The standards are essentially a labeling system for where, when and how data was collected and generated, as well as its intended use and restrictions.

The data provenance standards, announced Thursday, have been developed by the Data and Trust Alliance, a non-profit group made up of two dozen mainly large companies and organizations, including American Express, Humana, IBM, Pfizer, UPS and Walmart, as well as a few startups.

The alliance members believe the data-labeling system will be similar to the fundamental standards for food safety that require basic information like where food came from, who produced and grew it, and who handled the food on its way to a grocery shelf.

Greater clarity and more information about the data used in AI models, executives say, will bolster corporate confidence in the technology. How widely the proposed standards will be used is uncertain, and much will depend on how easy the standards are to apply and automate. But standards have accelerated the use of every significant technology, from electricity to the internet.

“This is a step toward managing data as an asset, which is what everyone in industry is trying to do today,” said Ken Finnerty, president for information technology and data analytics at UPS. “To do that, you have to know where the data was created, under what circumstances, its intended purpose and where it’s legal to use or not.”

Surveys point to the need for greater confidence in data and for improved efficiency in data handling. In one poll of corporate CEOs, a majority cited “concerns about data lineage or provenance” as a key barrier to AI adoption. And a survey of data scientists found that they spent nearly 40 per cent of their time on data preparation tasks.

The data initiative is mainly intended for business data that companies use to make their own AI programs or data they may selectively feed into AI systems from companies like Google, OpenAI, Microsoft and Anthropic. The more accurate and trustworthy the data, the more reliable the AI-generated answers.

For years, companies have been using AI in applications that range from tailoring product recommendations to predicting when jet engines will need maintenance.

But the rise in the past year of the so-called generative AI that powers chatbots like OpenAI’s ChatGPT has heightened concerns about the use and misuse of data. These systems can generate text and computer code with humanlike fluency, yet they often make things up – “hallucinate,” as researchers put it – depending on the data they access and assemble.

Companies do not typically allow their workers to freely use the consumer versions of the chatbots. But they are using their own data in pilot projects that use the generative capabilities of the AI systems to help write business reports, presentations and computer code. And that corporate data can come from many sources, including customers, suppliers, weather and location data.

“The secret sauce is not the model,” said Rob Thomas, IBM’s senior vice-president of software. “It’s the data.”

In the new system, there are eight basic standards, including lineage, source, legal rights, data type and generation method. Then there are more detailed descriptions for most of the standards – such as noting that the data came from social media or industrial sensors, for example.

The data documentation can be done in a variety of widely used technical formats. Companies in the data consortium have been testing the standards to improve and refine them, and the plan is to make them available to the public early next year.

Labeling data by type, date and source has been done by individual companies and industries. But the consortium said these are the first detailed standards meant to be used across all industries.

“My whole life I’ve spent drowning in data and trying to figure out what I can use and what is accurate,” said Thi Montalvo, a data scientist and vice-president of reporting and analytics at Transcarent.

Transcarent, a member of the data consortium, is a startup that relies on data analysis and machine-learning models to personalize health care and speed payment to providers.

The benefit of the data standards, Ms. Montalvo said, comes from greater transparency for everyone in the data supply chain. That workflow often begins with negotiating contracts with insurers for access to claims data and continues with the startup’s data scientists, statisticians and health economists who build predictive models to guide treatment for patients.

At each stage, knowing more about the data sooner should increase efficiency and eliminate repetitive work, potentially reducing the time spent on data projects by 15 per cent to 20 per cent, Ms. Montalvo estimates.

The data consortium said the AI market today needs the clarity the group’s data-labeling standards can provide. “This can help solve some of the problems in AI that everyone is talking about,” said Chris Hazard, a co-founder and the chief technology officer of Howso, a startup that makes data-analysis tools and AI software.

Follow related authors and topics

Authors and topics you follow will be added to your personal news feed in Following.

Interact with The Globe