How TF … does Google even work?!

This week, Shelby explains how Google's search engine works. We look at crawling, indexing and ranking – and why understanding the process can help you level up your search strategy.

Mar 21, 2022

Happy Monday, friends! Shelby here back behind the keyboard, hot off a weekend of doing the bare minimum. I’m trying this thing where I give myself permission to rest. Let me tell you, dear internet friends – it is not easy, but oh-so-necessary.

Today, we’ll be going through one of the essentials of SEO – how search engines work. We’ll go step-by-step through the process that search engines use to find, store and provide your journalism to the audience. The process is tricky, but understanding it can help you take your search strategy to the next level.

Let’s get it.

In this issue:

How do search engines work?
Crawling, indexing, ranking – what are they?
Why the search engine process matters

THE 101

How do search engines work?

Search engines have three main components: crawling, indexing and ranking. They crawl content across the internet, processing the information it finds into an index that is then used to rank content for relevant queries.

For us in journalism, the process can look like this:

A story is published > search engine web crawler (also known as a spider) explores the content > spider stores the story in the index > a reader searches for something > stories are ranked for relevant queries.

Google uses web crawlers to explore the internet through links on a regular basis, finding new information to add to its index. The web crawlers are constantly exploring new corners of the internet; it can take from a few days to a month for a new site to be added to the index. According to Google, the majority of sites listed in their results aren’t manually submitted, but rather found by automatic crawls.

To manually submit your new site, follow the steps laid out by Ahrefs.

Google is still king. Not all search engines are made the same, and each have small, intricate nuances that make it unique. However, more than 90 per cent of all web searches happen on Google – nearly 20 times more than Bing and Yahoo! combined. For the sake of simplicity, we interchangeably use Google and search engines throughout this issue.

THE HOW TO

Explaining each part of the process

Crawling

Crawling is the process by which the search engine sends crawlers (also known as spiders) to find and discover new pieces of content on the internet. Spiders crawl a link, following it as far as it can go, examining all of the information on the page to eventually include it in the index – a massive library of all of the pages the search engine can see and wants to eventually surface to a reader.

The same process happens regardless of the type of content. Web crawlers find the video, image, PDF or article by scouring the internet, jumping from site to site by connective links. Just like you read one story on one website and jump to another, so does Google.

The main method that Google crawls new web pages is by following links from pages they already know about. This is why internal linking and backlinking is so important, especially for new stories.

What is a crawl budget?

A crawl budget is the number of URLs Google can reasonably go through on your site in one instance of crawling.

Because a search engine can’t possibly crawl every page on your site when it visits, there needs to be a parameter for how many URLs a spider can crawl at any given time frame. This is based on the size and health of your site, as well as the number of links to your site.

If you exceed your crawl budget – you publish too many pages at once – then pages won’t be crawled and therefore cannot be indexed in Google’s library.

Pro tip: To quickly determine your site’s crawl budget, look in Google Search Console at your site’s crawl status report and divide the total number of crawls by the number of days on the report (You can also download the data into a CSV to use in Excel or Google Sheets).

For news sites, this can be a problem. We publish so many stories on a regular basis that it can become overwhelming for spiders to know which are most important. We must maximize our crawl budget.

Sites with less than a million pages will rarely hit their crawl limit, so you may not need to worry about this for a while. But the rate at which you publish stories can slow down your servers and therefore slow down how fast spiders can crawl your site.

How to maximize your crawl budget

Google has become much more granular with its crawling habits. In the past, it could take hours for a web crawler to return to your site and find new articles. Now, this can happen within a matter of minutes, and you can request/force Google to crawl and index a story through Google Search Console’s URL Inspection Tool (be careful: there are rules around using this).

Here are a few other tips for better crawling of your site:

Reduce errors on site: If you have pages that return a 400 error status code or redirect in circles, fix them. Search engines waste time when they find these pages trying to crawl them when they could be finding important pages.
Create multiple sitemaps: This allows multiple ways for Google to access the information on your site and crawl it based on the information provided.
Have a solid site structure: A crawler should be able to go from your homepage anywhere on your site with ease.
Block irrelevant parts of your site: There’s probably a login page, an old set of scanned articles, JavaScript codes or a page laying around you don’t want readers to find through search. Block these using robots.txt so Google doesn’t waste time indexing pages that don’t matter to your audience.
Build good backlinks: There’s no easier way to signal to Google that a piece is worth crawling than having someone else link to it.

Indexing

Indexing is the process by which Google actively stores the information about a page into a huge database, or library, including the headline, body copy, if it is accessible for free and when it was published.

This index is massive. Billions of web pages are stored in the index, and it is constantly being updated. If we change something in a story, the story must be re-indexed, or the search engine can’t provide the reader with the most recent version of the page.

You can check if a story is indexed by running it through the URL Inspection Tool in GSC, or filtering through Google’s results.

Pro tip: If you search “site:yourwebsite.com,” Google will provide you with all of the links within its index as the results. Review these results to see which URLs show up and whether they should be indexed.
Pro tip #2: To see how a Googlebot sees your page (and how it’s seen in the index), add “cache:” before the URL in the address bar, or clicking the button beside the URL in search results and clicking “Cached.”

Note: A search engine can crawl a piece of content and, for a variety of reasons, not index it. This can be done on purpose – you can put a robots.txt directive that instructs Google not to surface a page in search results, or you can have a canonical URL to a different page that tells the search engine there’s a main version of many duplicates. If this is not on purpose, check Google Search Console for any warnings or errors you may need to fix.

Ranking

Once your site is indexed, search engines can show it in search results. Where the page shows up is determined by Google’s rankings – which is dictated by its ever-changing algorithm.

Google’s algorithm is made up of a complex series of ranking factors, weighted differently to eventually come to a list of links that should – in theory – best answer a reader’s query.

We know that what is relevant to people in North Carolina, U.S.A is not relevant to people in Brisbane, Australia. Search engines know that, too, and try to provide search engine results pages (SERPs) full of information that is relevant based on location, device type, demographic, language and many other factors.

This is also where we see fluctuations in SERPs. As the algorithm is updated, Google will scan its index to determine which pages, at the time of being indexed, best represent a readers’ query. If an update is made to the algorithm and the indexed version no longer answers the request, the page may drop in rankings or be removed entirely (Moz has a list of all Google algorithms dating back to 2000, not including 2022).

Focus on best practices for on-page optimization, using structured data and site quality to ensure you rank for the queries relevant to your news site.

Why the search engine process matters

Knowing how search engines work can be extremely beneficial. We can use that knowledge of the process to our advantage.

News is about speed. We want to be the first to break the story, the first with the scoop and the first to be credited. Google works the same way – the faster you get the search engine to crawl your story and index it, the faster people can read it.

Understanding the search engine process gives you a few benefits:

You understand Google crawls the internet by a series of links and can focus on a strong backlink strategy.
Search engines have an easier time finding new stories when they are linked to from your homepage, so you can ensure your internal linking strategy is up to snuff.
Google indexes stories around topical authority and relevance, so you can focus on your niche subjects.

And most importantly…

The Google algorithm is always changing. We should focus on the best ways to optimize our articles for readers first.

The bottom line: The search engine process is complex, but follows the same premise – find, store and then provide your content to readers in the best way possible. Our job is to ensure we are providing the most accurate information to search engines so we can provide the best information to our readers.

WTF is SEO?

Discussion about this post