Amazon has a secret workaround to scrape Microsoft’s GitHub for AI model training data, leaked memo shows
To create powerful AI models, you need mountains of good data. Amazon is going to great lengths to collect this type of valuable information.
The company recently told employees to sign up for Microsoft’s GitHub software development platform and share their accounts so Amazon can scrape data from GitHub more quickly.
This is a key step in Amazon’s efforts to train its upcoming in-house AI model.
In an internal memo shared with employees last month, Amazon’s Artificial General Intelligence Group wrote that it needs “quantitative and qualitative metadata from GitHub” for AI training purposes.
But there’s a problem. A single GitHub account can only make 5,000 data-collection requests per hour. There are more than 150 million public data repositories on GitHub, so these account limitations mean scraping all this information would take too long, according to the memo.
To get around this, the Amazon AGI team is asking employees to create new GitHub accounts and share them with the company. Then, Amazon can run all these accounts simultaneously, reducing the time to collect data to just a “few weeks,” according to the memo.
“Fetching all of this with a single account would take many years,” the memo explained. “In order to increase the rate at which we can collect the metadata, we ask team members to create GitHub account[s] and share the API keys.”
Amazon’s head scientist Rohit Prasad
Amazon’s leadership is openly soliciting employee help with this workaround.
Rohit Prasad, Amazon’s head scientist and SVP of the AGI group, encouraged employees to share their GitHub accounts to help “collect more high-quality code data for training our foundation models,” according to an internal email from late May, titled “Help with data.”
Another email from an Amazon AGI director urged employees: “It only takes 5 minutes!”
The episode highlights the rabid thirst for data among tech companies developing their own AI models. These models need lots of high-quality information to become more intelligent and human-like. There’s a finite supply of this information, which is leading to a “war for data” among tech companies.
In Amazon’s case, the company needs more data to train a yet-to-be-released new AI model, internally dubbed its “most ambitious” AI project. Launching a new, more powerful AI model is important for Amazon, as the company is trying to catch up to rivals Microsoft, Google, and Meta in the generative AI space.
Alleged license violations
While the GitHub workaround will most likely speed up Amazon’s AI training process, it could raise ethical concerns over accessing data without appropriate permissions.
Microsoft is likely to be unhappy when it discovers that its arch rival Amazon is leaning hard on GitHub for AI training data.
Even Microsoft itself is facing a lawsuit for allegedly violating license agreements when it used GitHub data to train its Copilot AI service.
“Amazon supports the protection of rightsholders and content creators, as well as established legal frameworks that facilitate the development of innovative and beneficial services,” Amazon said in a statement. “Our LLMs are trained on data from a variety of sources, including licensed and proprietary data, open-source datasets, and publicly available data where appropriate. While this is an evolving area, we adhere to industry best practices around data collection to train our models.”
The company also explained that it has created systems to “properly credit open-source developers if generated code suggestions are similar to their projects.”
‘Showing our hand’
In the internal memo, Amazon wrote that the GitHub workaround was approved by both the company’s legal and security teams.
By following the guidelines, Amazon is making sure to follow GitHub’s rate limits and avoid getting its accounts blocked, it said.
In terms of “showing our hand,” the memo said, Amazon’s move “should not alarm anyone” because the company is working on multiple products at the same time, it added.
For employees interested in helping, the memo said they should use an Amazon work email, not a personal account, to sign up for GitHub.
It also said Amazon employees should create a “classic personal token,” not a “fine-grained personal token,” when signing up. GitHub classic personal tokens give access to a broader set of code repositories, though they may be less secure, according to GitHub’s website.
The Amazon instructions also said the expiration of these tokens should be set to one year, and no “scopes” should be selected to ensure the token has “read-only” access to public information.
Once they sign up, Amazon employees should copy-paste their personal access tokens in a shared company file, the memo added.
Most expansive’ models
For Amazon, more data is crucial for its new AI model. Last year, Amazon CEO Andy Jassy wrote in an internal email that Prasad would be leading the newly created AGI team, under the goal of building the “most expansive” large language models for the company. Prasad now reports directly to Jassy.
Amazon may be behind some of its AI competitors, who have been involved in a huge land grab to collect more training data for years.
OpenAI, for example, has been striking a series of licensing deals with a long list of companies, including Reddit, Shutterstock, and News Corp, to use their content for AI model training. Tech companies, hungry for even more training data, are also granting themselves new permissions to use a lot more of consumers’ information.
Amazon’s AGI team, meanwhile, already went through a major restructuring. In November, it laid off some of the employees who were working on Alexa-related projects, as we reported. Prasad also outlined the six new focus areas for the AGI group at the time, including foundational models and conversational assistant services, we previously reported.
A tricky position?
Though Amazon’s legal team has approved the GitHub data scraping workaround, the move could put Amazon in a tricky position.
In 2022, programmer Matthew Butterick and law firm Joseph Saveri filed a class action lawsuit against GitHub owner Microsoft, alleging open-source license violations. Microsoft trained its Copilot AI service on publicly available code from GitHub, without complying with the “underlying open-source licenses and other legal requirements,” according to Joseph Saveri’s website.
While open-source code on GitHub is generally free to use, it comes with certain obligations, such as preserving accurate attribution of the source code, Butterick wrote on the website about the lawsuit. For Copilot, it is nearly impossible to credit the original source since it’s built on billions of lines of code from GitHub, while Microsoft gets to sell it without giving back anything to the open-source community, he wrote.
“Like Neo plugged into the Matrix, or a cow on a farm, Copilot wants to convert us into nothing more than producers of a resource to be extracted (Well, until we can be disposed of entirely),” Butterick wrote. “And for what? Even the cows get food & shelter out of the deal. Copilot contributes nothing to our individual projects. And nothing to open source broadly.”