Agents Are Creating New Data Sets. Now What?

Why agentic data generation will change the downstream tool stack.

Ellie Fields

January 29, 2026

Future of Data

Agents can now generate completely new data sets. How does that change things? 

To explore this question, I used Parallel.ai’s FindAll tool to assemble a dataset of all parks within 100 miles of Seattle. Park-finding is a classic data federation problem: there are park finders at various state and  local sites, but no one single park finder. 

Here’s a walkthrough of that flow. The full process is at the end of the post. 

What doesn’t change with agentic data generation 

Even when agents generate the data, the fundamentals of how humans evaluate data stay the same.

You still need to ask: 

  • What’s in there? i.e. what’s the shape and scope of the data?  
  • Do I trust it? i.e. does it seem accurate, given what I already know? 
  • Do I care? i.e., will this data answer my question? How hard will it be to get the answer? 

The first question, “what’s in there?” is even more relevant when agents gather the data. If you’re pricing bananas and you accidentally prompt for data about bandanas, your whole strategy may (should?) fail. It seems a trivial example but I've seen the equivalent happen, without even needing agents to help.

Likewise, “Do I trust it?” is most relevant when working with a new data set– and most agent-generated data sets are, by definition, new. 

What does change

What does change is the tempo. Agent-generated data will create pressure for flexible (and likely agentic) tools that let people work with data as soon as it's generated. Research has shown time and again that people can understand data much faster when it's organized and aggregated visually.  In the parks data, I could immediately see that there several duplicates by charting parks by size.  As I explored, I could ask questions using the Data Agent to learn more ("What parks are in "Other?") 

This is where AI comes in. We’ve already seen how the ability to ask natural-language questions of data means more flexibility and less need for bespoke dashboards. AI is going to have to shoulder the load of bringing data into a visual, queryable format so that humans can do the sense-making we need to do.

I mean, we’ve known for decades that it takes too long to build dashboards (the primary way to make data visual), and that inhibits people’s ability to think with and use data. Soon, anything longer than a few minutes will be “too long.” 

Agents can of more of the work, but humans still need to understand the data.

Does it matter? 

“Yes,” you may say, “but I’m not searching the web for my data. It still comes from traditional pipelines” 

True. But agentic workflows are not only for the web: agents are emerging to make it easier to use all that enterprise data. For example, Bobsled helps product companies “build agentic experiences on complex data.” Spice AI provides fast and federated access to enterprise data to allow agents to access it. 

Agents that can gather data on the fly hold the promise of unlocking value in that data. And that will drive the rest of the data stack to be more interoperable and agentic as well. 

Appendix: How I generated the data, and some observations.

What I did: 

  • I started a Findall search with this prompt: “Find all parks within 100 miles of seattle. Please list Park name, Park location (city, state), Size in Acres, Park owner/ administrator (Example: Washington State Parks, Bellevue Parks, National Park, etc) and major features.”
  • I ran the search several times and got a variety of results. I ended up extending my original search with another 250 entities because I also knew it was incomplete. 
  • Although “size in acres” and “park owner/ administrator” were part of the original query, the data in them was not well formed enough to use in analysis. I added an enrichment for both fields, specifying that the first should be numeric and the second constrained to a list. 
  • The only data manipulation I did was to cast the “number of acres” field to be a number as it came back a string in the Parallel data set, even after the enrichment step. 
  • The result set of 389 parks is pretty good but has some obvious errors, mostly duplication (two of Rainier National Park, and “Washington State Parks” being named a park instead of a park system, for example). 
  • From a cost perspective, Parallel always estimated the search cost as much higher than it ultimately was, occasionally 10x higher. This was consistent across different prompts and runs. 

Ready to Learn more?   

Sign up to get product updates. Beta coming soon! 

Stay in the loop