ANOTHER WEEK, ANOTHER record-breaking AI research study released by Google—this time with results that are a reminder of a crucial business dynamic of the current AI boom. The ecosystem of tech companies that consumers and the economy increasingly depend on is traditionally said to be kept innovative and un-monopolistic by disruption, the process whereby smaller companies upend larger ones. But when competition in tech depends on machine learning systems powered by huge stockpiles of data, slaying a tech giant may be harder than ever.
Google’s new paper, released as a preprint Monday, describes an expensive collaboration with Carnegie Mellon University. Their experiments on image recognition tied up 50 powerful graphics processors for two solid months, and used an unprecedentedly huge collection of 300 million labeled images (much work in image recognition uses a standard collection of just 1 million images). The project was designed to test whether it’s possible to get more accurate image recognition not by tweaking the design of existing algorithms but just by feeding them much, much more data.
The answer was yes. After Google and CMU’s researchers trained a standard image processing system on their humungous new dataset, they say it produced new state-of-the-art results on several standard tests for how well software can interpret images, such as detecting objects in photos. There was a clear relationship between the volume of data they pumped in and the accuracy of image recognition algorithms that came out. The findings go some way to clear up a question circulating in the AI research world about whether more could be squeezed from existing algorithms just by giving them more data to feed on.
Showing that more data can equal more performance out even at huge scale suggests that there could be even greater benefits to being a data-rich tech giant like Google, Facebook, or Microsoft than previously realized. Crunching Google’s giant dataset of 300 million images didn’t produce a huge benefit—jumping from 1 million to 300 million images increased the object detection score achieved by just 3 percentage points—but the paper’s authors say they think can widen that advantage by tuning their software to be better suited to super-large datasets. Even if that turns out not to be the case, in the tech industry small advantages can be important. Every incremental gain in the accuracy of self-driving car vision will be crucial, for example, and a small efficiency boost to a product that draws billions in revenue adds up fast.
Data hoarding is already well established as a defensive strategy among AI-centric companies. Google, Microsoft and others have open-sourced lots of software, and even hardware designs, but are less free with the kind data that makes such tools useful. Tech companies do release data: Last year, Google released a vast dataset drawn from more than 7 million YouTube videos, and Salesforce opened up one drawn from Wikipedia to help algorithms work with language. But Luke de Oliveira, a partner at AI development lab Manifold and a visiting researcher at Lawrence Berkeley National Lab, says that (as you might expect) such releases don’t usually offer much of value to potential competitors. “These are never datasets that are truly crucial for the continued market position of a product,” he says.
Google and CMU’s researchers do say they want their latest study on the value of what they dub “enormous data” to catalyze the creation of much larger, Google-scale, open image datasets. “Our sincere hope is that this inspires the vision community to not undervalue the data and develop collective efforts in building larger datasets,” they write. Abhinav Gupta of CMU, who worked on the study, says one option could be to work with the Common Visual Data Foundation, a nonprofit sponsored by Facebook and Microsoft that has released open image datasets.
Meanwhile, data-poor companies that want to survive in a world where the data rich can expect their algorithms to be smarter have to get creative. Jeremy Achin, CEO of startup DataRobot, guesses that a model seen in insurance where smaller companies (carefully) pool data to make their risk predictions competitive with larger competitors might catch on more broadly as machine learning becomes important to more companies and industries.
Progress on making machine learning less data hungry could upend the data economics of AI; Uber bought one company working on that last year. But right now it's also possible to try and sidestep the AI incumbents’ usual data advantage. Rachel Thomas, cofounder of , which works to make machine learning more accessible, says startups can find places to get rich applying machine learning outside the usual purview of internet giants, such as agriculture. “I’m not sure these large companies necessarily have a huge advantage everywhere, in a lot of these specific domains data just isn’t being collected at all by anyone,” she says. Even artificially intelligent giants have blind spots.