Why Ai Startups Are Taking Data Into Their Own Hands

Trending 3 weeks ago

For 1 week this summer, Taylor and her roommate wore GoPro cameras strapped to their foreheads arsenic they painted, sculpted, and did family chores. They were training an AI imagination model, cautiously syncing their footage truthful nan strategy could get aggregate angles connected nan aforesaid behavior. It was difficult activity successful galore ways, but they were good paid for it — and it allowed Taylor to walk astir of her time making art. 

“We woke up, did our regular routine, and past strapped nan cameras connected our caput and synced nan times together,” she told me. “Then we would make our meal and cleanable nan dishes. Then we’d spell our abstracted ways and activity connected art.” 

They were hired to nutrient 5 hours of synced footage each day, but Taylor quickly learned she needed to allot 7 hours a time for nan work, to time off capable clip for breaks and beingness recovery. 

“It would springiness you headaches,” she said. “You return it disconnected and there’s conscionable a reddish quadrate connected your forehead.” 

Taylor, who asked not to springiness her past name, was moving arsenic a information freelancer for Turing Labs, an AI institution which connected her to TechCrunch. Turing’s extremity wasn’t to thatch nan AI really to make lipid paintings, but to summation much absurd skills astir sequential problem-solving and ocular reasoning. Unlike a ample connection model, Turing’s imagination exemplary would beryllium trained wholly connected video — and astir of it would beryllium collected straight by Turing. 

Alongside artists for illustration Taylor, Turing is contracting pinch chefs, building workers, and electricians — anyone who useful pinch their hands. Turing Chief AGI Officer Sudarshan Sivaraman told TechCrunch nan manual postulation is nan only measurement to get a varied capable dataset. 

“We are doing it for truthful galore different kinds of blue-collar work, truthful that we person a diverseness of information successful nan pre-training phase,” Sivaraman told TechCrunch. “After we seizure each this information, nan models will beryllium capable to understand really a definite task is performed.” 

Techcrunch event

San Francisco | October 27-29, 2025

Turing’s activity connected imagination models is portion of a increasing displacement successful really AI companies woody pinch data. Where training sets were erstwhile scraped freely from nan web aliases collected from low-paid annotators, companies are now paying apical dollar for cautiously curated data.  

With nan earthy powerfulness of AI already established, companies are looking to proprietary training information arsenic a competitory advantage. And alternatively of farming retired nan task to contractors, they’re often taking connected nan activity themselves. 

The email institution Fyxer, which uses AI models to benignant emails and draught replies, is 1 example.  

After immoderate early experiments, laminitis Richard Hollingsworth discovered nan champion attack was to usage an array of mini models pinch tightly focused training data. Unlike Turing, Fyxer is building disconnected personification else’s instauration exemplary — but nan underlying penetration is nan same.  

“We realized that nan value of nan data, not nan quantity, is nan point that really defines nan performance,” Hollingsworth told me. 

In applicable terms, that meant immoderate unconventional unit choices. In nan early days, Fyxer engineers and managers were sometimes outnumbered four-to-one by nan executive assistants needed to train nan model, Hollingsworth says. 

“We utilized a batch of knowledgeable executive assistants, because we needed to train connected nan fundamentals of whether an email should beryllium responded to,” he told TechCrunch. “It’s a very people-oriented problem. Finding awesome group is very hard.” 

The gait of information postulation ne'er slowed down, but complete clip Hollingsworth became much precious astir nan information sets, preferring smaller sets of much tightly curated datasets erstwhile it came clip for post-training. As he puts it, “the value of nan data, not nan quantity, is nan point that really defines nan performance.” 

That’s peculiarly existent erstwhile synthetic information is used, magnifying some nan scope of imaginable training scenarios and nan effect of immoderate flaws successful nan original dataset. On nan imagination side, Turing estimates that 75 to 80 percent of its information is synthetic, extrapolated from nan original GoPro videos. But that makes it moreover much important to support nan original dataset arsenic high-quality arsenic possible. 

“If nan pre-training information itself is not of bully quality, past immoderate you do pinch synthetic information is besides not going to beryllium of bully quality,” Sivaraman says. 

Beyond concerns of quality, there’s a powerful competitory logic down keeping information postulation in-house. For Fyxer, nan difficult activity of information postulation is 1 of nan champion moats nan institution has against competition. As Hollingsworth sees it, anyone tin build an open-source exemplary into their merchandise – but not everyone tin find master annotators to train it into a workable product. 

“We judge that nan champion measurement to do it is done data,” he told TechCrunch, “through building civilization models, done precocious quality, quality led information training.” 

More