Has Google’s Search algorithm been Exposed?

This is Erfan Azimi, an SEO guru and owner of EA Eagle Digital. He says there were some ex Googlers who have exposed some of the inner working of the google search mystery box. Here are some of the main points taken from the Rand Fishkin article on SparkToro.

  • In their early years, Google’s search team recognized a need for full clickstream data (every URL visited by a browser) for a large percent of web users to improve their search engine’s result quality.
  • A system called “NavBoost” (cited by VP of Search, Pandu Nayak, in his DOJ case testimony) initially gathered data from Google’s Toolbar PageRank, and desire for more clickstream data served as the key motivation for creation of the Chrome browser (launched in 2008).
  • NavBoost uses the number of searches for a given keyword to identify trending search demand, the number of clicks on a search result (I ran several experiments on this from 2013-2015), and long clicks versus short clicks (which I presented theories about in this 2015 video).
  • Google utilizes cookie history, logged-in Chrome data, and pattern detection (referred to in the leak as “unsquashed” clicks versus “squashed” clicks) as effective means for fighting manual & automated click spam.
  • NavBoost also scores queries for user intent. For example, certain thresholds of attention and clicks on videos or images will trigger video or image features for that query and related, NavBoost-associated queries.
  • Google examines clicks and engagement on searches both during and after the main query (referred to as a “NavBoost query”). For instance, if many users search for “Rand Fishkin,” don’t find SparkToro, and immediately change their query to “SparkToro” and click SparkToro.com in the search result, SparkToro.com (and websites mentioning “SparkToro”) will receive a boost in the search results for the “Rand Fishkin” keyword.
  • NavBoost’s data is used at the host level for evaluating a site’s overall quality (my anonymous source speculated that this could be what Google and SEOs called “Panda”). This evaluation can result in a boost or a demotion.
  • Other minor factors such as penalties for domain names that exactly match unbranded search queries (e.g. mens-luxury-watches.com or milwaukee-homes-for-sale.net), a newer “BabyPanda” score, and spam signals are also considered during the quality evaluation process.
  • NavBoost geo-fences click data, taking into account country and state/province levels, as well as mobile versus desktop usage. However, if Google lacks data for certain regions or user-agents, they may apply the process universally to the query results.
  • During the Covid-19 pandemic, Google employed whitelists for websites that could appear high in the results for Covid-related searches
  • Similarly, during democratic elections, Google employed whitelists for sites that should be shown (or demoted) for election-related information

And these are only the tip of the iceberg.

Extraordinary claims require extraordinary evidence. And while some of these overlap with information revealed during the Google/DOJ case (some of which you can read about on this thread from 2020), many are novel and suggest insider knowledge.”

He says that this is how he verified its authenticity.

A critical next step in the process was verifying the authenticity of the API Content Warehouse documents.  So, I reached out to some ex-Googler friends, shared the leaked docs, and asked for their thoughts. Three ex-Googlers wrote back: one said they didn’t feel comfortable looking at or commenting on it. The other two shared the following (off the record and anonymously):

  • “I didn’t have access to this code when I worked there. But this certainly looks legit. “
  • “It has all the hallmarks of an internal Google API.”
  • “It’s a Java-based API. And someone spent a lot of time adhering to Google’s own internal standards for documentation and naming.”
  • “I’d need more time to be sure, but this matches internal documentation I’m familiar with.”
  • “Nothing I saw in a brief review suggests this is anything but legit.”

Next, I needed help analyzing and deciphering the naming conventions and more technical aspects of the documentation. I’ve worked with APIs a bit, but it’s been 20 years since I wrote code and 6 years since I practiced SEO professionally. So, I reached out to one of the world’s foremost technical SEOs: Mike King, founder of iPullRank.

During a 40-minute phone call on Friday afternoon, Mike reviewed the leak and confirmed my suspicions: this appears to be a legitimate set of documents from inside Google’s Search division, and contains an extraordinary amount of previously-unconfirmed information about Google’s inner workings.

2,500 technical documents is an unreasonable amount of material to ask one man (a dad, husband, and entrepreneur, no less) to review in a single weekend. But, that didn’t stop Mike from doing his best.
He’s put together an exceptionally detailed initial review of the Google API leak here, which I’ll reference more in the findings below. And he’s also agreed to join us at SparkTogether 2024 in Seattle, WA on Oct. 8, where he’ll present the fully transparent story of this leak in far greater detail, and with the benefit of the next few months of analysis.

Similar Posts

Leave a Reply