How to Clean Your Data Without Breaking the Bank for AI Agents

Companies spend a lot of money on bad data, especially when that data powers AI. How much are we talking about? A 2024 survey by data integration company Fivetran found that artificial intelligence (AI) trained on inaccurate, incomplete, or low-quality data can cost large businesses 6% of their revenue, or an average of $406 million annually.
“If you don’t have good data, you’re probably not going to be making the best business decisions,” said Karim Habbal, vice president, data management solutions, at Salesforce. “There’s a real impact on both the day-to-day tactical decisions and the multiyear decisions being made.”

With that big a bite out of revenue, you’d think leaders would rush to get going on data cleaning. But time, labor, and tech tools are expensive, and some companies don’t want to make the investment. That’s shortsighted. Even a modest investment can yield rewards later. Our list of five cost-effective ways to clean data can get you started.

The cost of bad data

Bad data harms both reputations and financial results, as numerous major brands have discovered. One major airline ended up in court after its chatbot explained its bereavement travel policy incorrectly, telling a customer he was eligible for a refund when he was not. Elsewhere, a data error in an automated air traffic control system cancelled 2,000 flights in the U.K. and Ireland, leaving thousands of travelers stranded and airlines suffering as much as $135 million in losses.

The costs may also be less obvious. A minor typo in a customer’s address could lead to missed communications, missed deliveries, and lost sales. Additionally, customer trust is too valuable to be measured in dollars. Customers may move their business elsewhere if an AI agent displays hallucinations or answers questions incorrectly. They have no regard for the AI’s error. They’ll only remember it was your company.

How to clean your data in a cost-effective way

Your AI agent will only be as good as the data you feed it. But it may be easier — and less expensive — to get your data ready than you think. Here’s how:

1. Prioritize which data needs to be cleaned

Start by cleaning only the data your agent needs.
Salesforce does this with its own agents, which are powered by Agentforce, the company’s platform for building and deploying AI agents. The product team focuses on the task or tasks they want the agent to perform when building an agent. “Those jobs are called ‘topics,’ and the topics are a way of routing a user query to a specific thing the agent can do,” said Daniel Zielaski, vice president, data science, at Salesforce. Once the product team has identified a topic, they build a “corpus,” which is the knowledge base an agent needs to carry out its task.
Zielaski pointed to Salesforce’s new sales development representative (SDR) agent as an example. When writing outreach emails to prospects, the SDR agent needs account, lead, and contact information that is clean and up to date. But it doesn’t need information on how to solve a tech problem. “We identify the data that will be consumed by a specific topic, and then we focus on improving its overall quality, versus boiling the ocean and trying to clean all our data,” he said.

2. Manage your labor costs

For many companies, the largest data-related cost is labor. For instance, the average annual salary for a data engineer in San Francisco is $178,000. Additionally, the expense of benefits, training, and salaries can quickly mount up when you establish an entire in-house data team. Internal teams are crucial for handling sensitive data like health or financial information. Additionally, they provide institutional knowledge and continuity. But for less sensitive data, you could use an outside provider or freelancers, which would allow you to pay only for the services you need. Or you could use a combo of both, a hybrid approach.

You can also use Salesforce’s Data Cloud, which solves one of the biggest problems companies face: pulling data from different software systems into one place for an AI agent to read. “The product has been designed so that you don’t have to pay for a large data engineering team,” said Zielaski. “You don’t have to pay for an architecture team. You don’t have to pay a group of people to go in and use code to move data from one place to another.”

3. Automate as much data cleaning as possible

The Fivetran survey found that data scientists spend most of their time (67%) preparing data rather than building and refining AI models. But there’s a way to lighten their load: Automate your data quality processes.

The amount of time required to monitor and clean data can be drastically reduced by automating processes related to data quality, either through code or data quality tools. Yes, it necessitates an initial investment. But a Forrester report found that data quality tools catch issues sooner, improving resolution time by 90% and saving 5,184 data engineer hours.

They do this partly by detecting anomalies. Habbal’s team, for example, uses various data quality tools to automatically profile data sets, including those that calculate annual contract value (ACV), a critical financial metric. He offered a hypothetical data set in which the typical ACV for each customer ranged from $10 million to $50 million. If the data quality tool discovers an ACV for $30, Habbal said, “we’re then alerted, and can investigate it.”

Habbal’s team also uses these tools to monitor data for completeness, timeliness, accuracy, and conformity. “Basically, what that means is, I can create a rule that says, ‘Trigger an alert when the completeness of the data falls below 99%’,” he said.

Why is this important?

Habbal stated, “We don’t want to give [Salesforce CEO Marc Benioff] a data set that’s only 90% complete if we were going to report the quarterly ACV to him.” “For that situation, we’d have a very high threshold with our data quality tool, that the data needs to be 99% complete or greater.”

4. Put a data governance policy in place

Establishing clear governance that incorporates data stewardship is yet another strategy for reducing expenses. In other words, spell out who’s responsible for a specific set of data.
Consider the hypothetical example of data created in a business application. As the data moves downstream for analytical or reporting use cases, it might be replicated four times. When someone discovers there’s an issue with the data, “we don’t want four different teams to remediate their copies of the data,” said Habbal. If you have clear ownership of the data, only one team will be responsible, which means fewer labor costs.

A governance policy that outlines your stance on access, security, and compliance also protects you against risk. Errors in financial reporting or the improper handling of personal data can lead to costly fines and legal battles. And compliance issues drain resources, too. Clear governance lessens these risks.

5. Use AI to prevent bad data in the first place

In 1992, George Labovitz and Yu Sang Chang, then both professors at the Boston University School of Management, introduced the 1:10:100 rule of data quality. Their rule asserts:
The cost of preventing poor data quality at the source is $1 per record.
The cost of remediation after a data quality issue has been identified is $10.
The cost of doing nothing is $100.
Those numbers have likely changed over the years, but the idea is the same: One of the best ways to save money is to prevent bad data from entering your system in the first place. AI can assist. Salesforce’s SDR agent, according to Zielaski, was a good example. A lead is created when a potential customer visits Salesforce’s website and submits a form. But the form needs to be filled out in a specific way to create standardized, well-formatted data. Prospects will be required to re-enter their phone number if they add an additional digit to it. Or they may not be able to click the Submit button if a field is left blank.

Preventing bad data gets even more challenging when a company goes by different names. Japan’s Nippon Airways, for example, is often called ANA. If the airline’s employees fill out Salesforce website forms using different company names at different times, duplicate accounts will be created — and Salesforce might send redundant outreach emails. To avoid this, a Salesforce team builds AI algorithms to de-duplicate entries, and scrub and clean data to make sure it’s pristine. “Think of the algorithms like vacuum cleaners that are constantly fixing up all that data,” said Zielaski.
The revenue generated by the SDR agent offsets the cost of this team. “If you’re building an autonomous agent that can generate a pipeline of hundreds of millions of dollars, and the only thing you’ve got to do is build a five to 10 person team to manage data quality,” Zielaski said, “find a CEO that isn’t willing to make that investment.”

Data cleaning is worth every penny

Prepping your data for AI can feel daunting and expensive. But you can make the CEO and CFO happy if you break the task down, clean only the data you need, and carefully allocate resources. It’s an investment you won’t regret.