Ecommerce SEO

CHAPTER 4:

Crawl Optimization

Length: 6,918 words

Estimated reading time: 50 minutes

This guide has almost 400 pages of advanced, actionable insights into on-page SEO for ecommerce.

Written by an ecommerce SEO consultant with 20 years of research and practical experience, this exhaustive SEO resource will teach you how to identify and address all of the SEO issues specific to ecommerce websites, in one place.

The strategies and tactics described in this guide have been successfully implemented on Top 10 online retailers, small & medium businesses, and mom and pop stores.

Please share and link to this guide if you liked it.

Crawl optimization is aimed at helping search engines discover URLs in the most efficient manner. Relevant pages should be easy to reach, while less important pages should not waste the so-called “crawl budget” and should not create crawl traps. Crawl budget is defined as the number of URLs search engines can and want to crawl.

Search engines assign a crawl budget to each website, depending on the authority of the website. Generally, the authority of a site is somehow proportional to its PageRank.

The concept of crawl budget is essential for ecommerce websites because they usually comprise of a vast number of URLs—from tens of thousands to millions.

If the technical architecture puts the search engine crawlers (also known as robots, bots or spiders) in infinite loops or traps, the crawl budget will be wasted on pages that are not important for users or search engines. This waste may lead to important pages being left out of search engines’ indices.

Additionally, crawl optimization is where very large websites can take advantage of the opportunity to have more critical pages indexed and low PageRank pages crawled more frequently.[1]

The number of URLs Google can index increased dramatically after the introduction of their Percolator[2] architecture, with the “Caffeine” update.[3] However, it is still important to check what resources search engine bots request on your website and to prioritize crawling accordingly.

Before we begin, it is important to understand that crawling and indexing are two different processes. Crawling means just fetching files from websites. Indexing means analyzing the files and deciding whether they are worthy of inclusion. So, even if search engines crawl a page, they will not necessarily index it.

Crawling is influenced by several factors such as the website’s structure, internal linking, domain authority, URL accessibility, content freshness, update frequency, and the crawl rate settings in webmaster tools accounts.
Before detailing these factors, let’s talk about tracking and monitoring search engine bots.

Tracking and monitoring bots

Googlebot, Yahoo! Slurp, and Bingbot are polite bots,[4] which means that they will first obey the crawling directives found in robots.txt files, before requesting resources from your website. Polite bots will identify themselves to the web server, so you can control them as you wish. The requests made by bots are stored in your log files and are available for analysis.

Webmaster tools, such as the ones provided by Google and Bing, only uncover a small part of what bots do on your website—e.g., how many pages they crawl or bandwidth usage data. That is useful in some ways but is not enough.

For really useful insights, you have to analyze the traffic log files. From these, you will be able to extract information that can help identify large-scale issues.

Traditionally, log file analysis was performed using the grep command line with regular expressions. But, lately, there are also desktop and web-based solutions that will make this type of geek analysis easier and more accessible to marketers.

On ecommerce websites, monthly log files are usually huge—gigabytes or even terabytes of data. However, you do not need all the data inside the log files to be able to track and monitor search engine bots. You need just the lines generated by bot requests. This way you can significantly reduce the size of the log files from gigabytes to megabytes.

Using the following Linux command line (case sensitive) will extract just the lines containing “Googlebot”, from one log file (access_log.processed) to another (googlebot.log):
grep “Googlebot” access_log.processed > googlebot.log

To extract similar data for Bing and other search engines, replace “Googlebot” with other bot names.

Figure 86 – The log file was reduced from 162.5Mb to 1.4Mb.

Open the bot-specific log file with Excel, go to Data –> Text to Columns, and use Delimited with Space to enter the log file data into a table format like this one:

Figure 87 – The data is filtered by Status, to get a list of all 404 Not Found errors encountered by Googlebot.

Note: you can import only up to one million rows in Excel; if you need to import more, use MS Access or Notepad++.

To quickly identify crawling issues at category page levels, chart the Googlebot hits for each category. This is where the advantage of category-based navigation and URL structure comes in handy.

Figure 88 – It looks like the /bracelets/ directory needs some investigation because there are too few bot requests compared to the other directories.

By pivoting the log file data by URLs and crawl date, you can identify content that gets crawled less often:

Figure 89 – The dates the URLs have been fetched.

In this pivot table, you can see that although the three URLs are positioned at the same level in the hierarchy, URL number three gets crawled much more often than the other two. This is a sign that URL #3 is deemed more important.

Figure 90 – More external backlinks and social media mentions may result in an increased crawl frequency.

Here are some issues and ideas you should consider when analyzing bot behavior using log files:

  • Analyze server response errors and identify what generates those errors.
  • Discover unnecessarily crawled pages and crawling traps.
  • Correlate days since the last crawl with rankings; when you make changes on a page, make sure to re-crawl it; otherwise the updates won’t be considered for rankings
  • Discover whether products listed at the top of listings are crawled more often than products listed on component pages (paginated listings). Consider moving the most important products on the first page, rather than having them on component pages.
  • Check the frequency and depth of the crawl.

The goal of tracking bots is to:

  • Establish where the crawl budget is used.
  • Identify unnecessary requests (e.g., “Write a Review” links that open pages with the exact content except for the product name, e.g., mysite.com/review.php?pid=1, mysite.com/review.php?pid=2 and so on).
  • Fix the leaks.

Instead of wasting budget on unwanted URLs (e.g., duplicate content URLs), focus on sending crawlers to pages that matter for you and your users.

Another useful application of log-files is to evaluate the quality of backlinks. Rent links from various external websites and point them at pages with no other backlinks (product detail pages or pages that support product detail pages). Then, analyze the spider activity on those pages. If the crawl frequency increases, then that link is more valuable than a link that does not increase spider activity at all. An increase in crawling frequency on your pages suggests that the page you got the link from also gets often crawled, which means that the linking page has good authority. Once you identified good opportunities, work to get natural links from those websites.

Flat website structure

If there are no other technical impediments to crawling large websites (e.g., crawlable facets or infinite spaces[5]), a flat website architecture can help crawling by allowing search engines to reach deep pages in very few hops, therefore using the crawl budget very efficiently.

Pagination—or, to be more specific, de-pagination—is one way to flatten your website architecture. We will discuss pagination later, in the Listing Pages section.

For more information on flat website architecture, please refer to the section titled The Concept of Flat Architecture in the Site Architecture section.

Accessibility

I will refer to accessibility in terms of optimization for search engines rather than optimization for users.

Accessibility is probably a critical factor for crawling. Your crawl budget is dictated by how the server responds to bot traffic. If the technical architecture of your website makes it impossible for search engine bots to access URLs, then those URLs will not be indexed. URLs that are already indexed but are not accessible after a few unsuccessful attempts may be removed from search engine indices.
Google crawls new websites at a low rate, then gradually increases up to the level it does not create accessibility issues for your users or your server.

So, what prevents URLs and content from being accessible?

DNS and connectivity issues
Use http://www.intodns.com/ to check for DNS issues. Everything that comes in red and yellow needs your attention (even if it is an MX record).

Figure 91 – Report from intodns.com.

Using Google and Bing webmaster accounts, fix all the issues related to DNS and connectivity:

Figure 92 – Bing’s Crawl Information report.

Figure 93 – Google’s Site Errors report in the old GSC.[6]

One DNS issue you may want to pay attention to is related to wildcard DNS records, which means the web server responds with a 200 OK code for any subdomain request, even for ones that do not exist. An even more severe problem related to DNS is unrecognizable hostnames, which means the DNS lookup fails when trying to resolve the domain name.

One large retailer had another misconfiguration. Two of its country code top-level domains (ccTLDs)—the US (.com) and the UK (.co.uk)—resolved to the same IP. If you have multiple ccTLDs, host them on different IPs (ideally from within the country you target with the ccTLD), and check how the domain names resolve.

Needless to say, if your web servers are down, no one will be able to access the website (including search engine bots). You can keep an eye on the availability of your site using server monitoring tools like Monitor.Us, Scoutt or Site24x7.

Host load
Host load represents the maximum number of simultaneous connections a web server can handle. Every page load request from Googlebot, Yahoo! Slurp, or Bingbot generates a connection with your web server. Since search engines use distributed crawling from multiple machines at the same time, you can theoretically reach the limits of the connections, and your website will crash (especially if you are on a shared hosting plan).

Use tools such as the one found at loadimpact.com to check how many connections your website can handle. Be careful though; your site can become unavailable or even crash during this test.

Figure 94 – If your website loads under two seconds when used by a large number of visitors, you should be fine – graph generated by loadimpact.com.

Page load time
Page load time is not only a crawling factor but also a ranking and usability factor. Amazon reportedly increased its revenue by 1% for every 100ms of load time improvement,[7] and Shopzilla increased revenue by seven to 12% by decreasing the page load time by five seconds.[8]

There are plenty of articles about page load speed optimization, and they can get pretty technical. Here are a few pointers to summarize how you can optimize load times:

  • Defer loading of images until needed for display in the browser.
  • Use CSS sprites.
  • Use http2 protocols.

Figure 95 – Amazon uses CSS sprites to minimize the number of requests to their server.

Figure 96 – Apple used sprites for their main navigation.

  • Use content delivery networks for media files and other files that do not update too often.
  • Implement database and cache (server-side caching) optimization.
  • Enable HTTP compression and implement conditional GET.
  • Optimize images.
  • Use expires headers.[9]
  • Ensure fast and responsive design to decrease the time to first byte (TTFB). Use http://webpagetest.org/ to measure TTFB. There seems to be a clear correlation between lower rankings and increased TTFB.[10]

If your URLs load slowly search engines may interpret this as a connectivity issue, meaning they will give up crawling the troubled URLs.

The time spent by Google on a page seems to influence the number of pages it crawls. The less time to download a page, the more pages are crawled.

Figure 97 – The correlation between the time spent downloading a page and the pages crawled per day seems apparent in this graph.

Broken links
This is a no-brainer. When your internal links are broken, crawlers will not be able to find the correct pages. Run a full crawl on the entire website with the crawling tool of your choice and fix all broken URLs. Also, use the webmaster tools provided by search engines to find broken URLs.

HTTP caching with Last-Modified/If-Modified-Since and E-Tag headers
In reference to crawling optimization, the term “cache” refers to a stored page in a search engine index. Note that caching is a highly technical issue, and improper caching settings may make search engines crawl and index a website chaotically.

When a search engine requests a resource on your website, it first requests your web server to check the status of that resource. The server will reply with a header response. Based on the header response, search engines will decide to download the resource or to skip it.

Many search engines check whether the resource they request has changed since they last crawled it. If it has, they will fetch it again—if not, they will skip it. This mechanism is referred to as conditional GET. Bing confirmed that it uses the If-Modified-Since header,[11] and Google does as well.[12]

Below is the header response for a newly discovered page that supports the If-Modified-Since header when a request is made to access it.

Figure 98 – Use the curl command to get the last modified date.

When the bot requests the same URL the next time, it will add an If-Modified-Since header request. If the document has not been modified, it will respond with a 304 status code (Page Not Modified):

Figure 99 – A 304 response header

If-Modified-Since will return 304 Not Modified if the page has not been changed. If it has been modified, the header response will be 200 OK, and the search engine will fetch the page again.

The E-Tag header works similarly but is more complicated to handle.

If your ecommerce platform uses personalization, or if the content on each page changes frequently, it may be more challenging to implement HTTP caching, but even dynamic pages can support If-Modified-Since.[13]

Sitemaps

There are two major types of sitemaps:

You can also submit Sitemaps in the following format: plain text files, RSS, or mRSS.
If you experience crawling and indexing issues, keep in mind that sitemaps are just a patch for more severe problems such as duplicate content, thin content or improper internal linking. Creating sitemaps is a good idea, but it will not fix those issues.

HTML sitemaps

HTML sitemaps are a form of secondary navigation. They are usually accessible to people and bots through a link placed at the bottom of the website, in the footer.

A usability study on a mix of websites, including ecommerce websites, found that people rarely use HTML sitemaps. In 2008, only 7% of the users turned to the sitemap when asked to learn about a site’s structure,[14] down from 27% in 2002. Nowadays, the percentage is probably even less.

Still, HTML sitemaps are handy for sending crawlers to pages at the lower levels of the website taxonomy and for creating flat internal linking.

Figure 100 – Sample flat architecture.

Here are some optimization tips for HTML sitemaps:

Use segmented sitemaps
When optimizing HTML sitemaps for crawling, it is important to remember that PageRank is divided between all the links on a page. Splitting the HTML sitemap into multiple smaller parts is a good way to create more user and search engine friendly pages for large websites, such as ecommerce websites.

Instead of a huge sitemap page that links to almost every page on your website, create a main sitemap index page (e.g., sitemap.html) and link from it to smaller sitemap component pages (sitemap-1.html, sitemap-2.html, etc.).

You can split the HTML sitemaps based on topics, categories, departments, or brands. Start by listing your top categories on the index page. The way you split the pages depends on the number of categories, subcategories, and products in your catalog. You can use the “100 links per page” rule below as a guideline, but do not get stuck on this number, especially if your website has good authority.

If you have more than 100 top-level categories, you should display the first 100 of them on the site map index page and the rest on additional sitemap pages. You can allow users and search engines to navigate the sitemap using previous and next links (e.g., “see more categories”).

If you have fewer than 100 top-level categories in the catalog, you will have room to list several important subcategories as well, as depicted below:

Figure 101- A clean HTML sitemap example.

The top-level categories in this site map are Photography, Computers & Solutions and Pro Audio. Since this business has a limited number of top-level categories, there is room for several subcategories (Digital Cameras, Laptops, Recording).

Do not link to redirects
The URLs linked from sitemap pages should land crawlers on the final URLs, rather than going through URL redirects.

Enrich the sitemaps
Adding a bit of extra data by annotating links with info is good for users and can provide some context for search engines as well. You can add data such as product thumbnails, customer ratings, manufacturer names, and so on.

These are just some suggestions for HTML sitemaps so that you can make the pages easier for people to read and very lightly linked for crawlers. However, the best way to help search engines discover content on your website is to feed them a list of URLs in different file formats. One such file format is XML.

XML Sitemaps

Modern ecommerce platforms should auto-generate XML Sitemaps, but many times the default output file is not optimized for crawling and analysis. It is therefore important to manually review and optimize the automated output or generate the Sitemaps on your own rules.

Unless you have concerns about competitors spying on your URL structure, it is preferable to include the path of the XML Sitemap file within the robots.txt file.

Robots.txt is requested by search engines every time they start a new crawling session on your website. It is analyzed to see if it was modified since the last crawl. If it wasn’t modified, then search engines will use the existing robots.txt cached file to determine which URLs can be crawled.

If you do not specify the location of your XML Sitemap inside robots.txt, then search engines will not know where to find it (except if you submitted it within the webmaster accounts). Submitting to Google Search Console or Bing Webmaster allows access to more insights, such as how many URLs have been submitted, how many are indexed, and what eventual errors are present in the Sitemap.

Figure 102 – If you have an almost 100% indexation rate you probably do not need to worry about crawl optimization.

Using XML Sitemaps seems to have an accelerating effect on the crawl rate:

“At first, the number of visits was stabilized at a rate of 20 to 30 pages per hour. As soon as the sitemap was uploaded through Webmaster Central, the crawler accelerated to approximately 500 pages per hour. In just a few days it reached a peak of 2,224 pages per hour. Where at first the crawler visited 26.59 pages per hour on average, it grew to an average of 1,257.78 pages per hour which is an increase of no less than 4,630.27%”.[15]

Here are some tips for optimizing XML Sitemaps for large websites:

  • Add only URLs that respond with 200 OK. Too many errors and search engines will stop trusting your Sitemaps. Bing has,

“a 1% allowance for dirt in a Sitemap. Examples of dirt are if we click on a URL and we see a redirect, a 404 or a 500 code. If we see more than a 1% level of dirt, we begin losing trust in the Sitemap”.[16]

Google is less stringent than Bing; they do not care about the errors in the Sitemap.

  • Have no links to duplicate content and no URLs that canonicalize to different URLs—only to “end state” URLs.
  • Place videos images, news, and mobile in separate Sitemaps. For videos, you can use video sitemaps, but mRSS formatting is supported as well.
  • Segment the Sitemaps by topic or category, and by subtopic or subcategory. For example, you can have a sitemap for your camping category – sitemap_camping.xml, another one for your Bicycles category – sitemap_cycle.xml, and another one for the Running Shoes category – sitemap_run.xml. This segmentation does not directly improve organic rankings, but it will help identify indexation issues at granular levels.
  • Create separate Sitemap files for product pages — segment by the lowest level of categorization.
  • Fix Sitemap errors before submitting your files to search engines. You can do this within your Google Search Console account, using the Test Sitemap feature:

Figure 103 – The Test Sitemap feature in Google Search Console.

  • Keep language-specific URLs in separate Sitemaps.
  • Do not assign the same weight to all pages (your scoring can be based on update frequency or other business rules).
  • Auto-update the Sitemaps whenever important URLs are created.
  • Include only URLs that contain essential and important filters (see section Product Detail Pages).

You probably noticed a commonality within these tips: segmentation. It is a good idea to split your XML files as much as you can without overdoing it (e.g., just 10 URL per file), so you can identify and fix indexation issues more easily.[17]

Keep in mind that sitemaps, either XML or HTML, should not be used as a substitute for poor website architecture or other crawlability issues, but only as a backup. Make sure that there are other paths for crawlers to reach all important pages on your website (e.g., internal contextual links).

Here are some factors that can influence the crawl budget:
Popularity
Crawlers will request pages more frequently if they find more external and internal links pointing to them. Most ecommerce websites experience challenges building links to category and product detail pages, but this has to be done. Guest posting, giveaways, link bait, evergreen content, outright link requests within confirmation emails, ambassador programs, and perpetual holiday category pages are just some of the tactics that can help with link development.

Crawl rate settings
You can alter (usually decrease) the crawl rate of Googlebot using your Google Search Console account. However, changing the rate is not advisable unless the crawler slows down your web server.
With Bing’s Crawl Control feature you can even set up day parting.

Figure 104 – Bing’s Crawl Control Interface.

Fresh content
Updating content on pages and then pinging search engines (i.e., by creating feeds for product and category pages) should get the crawlers to the updated content relatively quickly.

If you update fewer than 300 URLs per month, you can use the Fetch as Google feature inside your Google Search Console account to get the updated URLs re-crawled in a snap. Also, you can regularly (e.g., weekly) create and submit a new XML Sitemap just for the updated or for the new pages.

There are several ways to keep your content fresh. For example, you can include an excerpt of about 100 words from related blog posts on product detail pages. Ideally, the excerpt should include the product name and links to parent category pages. Every time you mention a product in a new blog post update the excerpt of the product detail page, as well.

You can even include excerpts from articles that do not directly mention the product name if the article is related to the category in which the product can be classified.

Figure 105 – The “From Our Blog” section keeps this page updated and fresh.

Another great tactic to keep the content fresh is to continuously generate user reviews, product questions and answers, or other forms of user-generated content.

Figure 106 – Ratings and reviews are a smart way to keep pages updated, especially for products in high demand.

Domain authority
The higher your website’s domain authority, the more visits search engine crawlers will pay. Your domain authority increases by pointing more external links to your website—this is a lot easier said than done.

RSS feeds
RSS feeds are one of the fastest ways to notify search engines of new products, categories, or other types of fresh content on your website. Here’s what Duane Forrester (former Bing’s Webmaster senior product manager) said in the past about RSS feeds:

“Things like RSS are going to become a desired way for us to find content … It is a dramatic cost savings for us”.[18]

You can get search engines to crawl the new content within minutes of publication with the help of RSS. For example, if you write content that supports category and product detail pages and if you link smartly from these supporting pages, search engines will request and crawl the linked-to product and categories URLs as well.

Figure 107 – Zappos has an RSS feed for brand pages. Users (and search engines) are instantly notified every time Zappos adds a new product from a brand.

Guiding crawlers

The best way to avoid wasting crawl budget on low-value-add URLs is to avoid creating links to those URLs, in the first place. However, that is not always an option. For example, you have to allow people to filter products based on three or more product attributes. Alternatively, you may want to allow users to email to a friend from product detail pages. Or, you have to give users the option to write product reviews. If you create unique URLs for “Email to a Friend “ links, for example, you may end up creating duplicate content.

Figure 108 – The URLs in the image above are near-duplicates. However, these URLs do not have to be accessible to search engines. Block the email-friend.php file in robots.txt

These “Email to a Friend” URLs will most likely lead to the same web form, and search engines will unnecessarily request and crawl hundreds or thousands of such links, depending on the size of your catalog. You will waste the crawl budget by allowing search engines to discover and crawl these URLs.

You should control which links are discoverable by search engine crawlers and which are not. The more unnecessary requests for junk pages a crawler makes, the fewer chances to get to more important URLs.

Crawler directives can be defined at various levels, in this priority:
Site-level, using robots.txt.

  • Page-level, with the noindex meta tag and with HTTP headers.
  • Element-level, using the nofollow microformat.

Site-level directives overrule page-level directives, and page-level directives overrule element-level directives. It is important to understand this priority because for a page-level directive to be discovered and followed, the site-level directives should allow access to that page. The same applies to element-level and page-level directives.

On a side note, if you want to keep content as private as possible, one of the best ways is to use server-side authentication to protect areas.

Robots.txt

Although robots.txt files can be used to control crawler access, the URLs disallowed with robots.txt may still end up in search engines indices because of external backlinks pointing to the “robotted” URLs. This suggests that URLs blocked with robots.txt can accumulate PageRank. However, URLs blocked with robots.txt will not pass PageRank, since search engines cannot crawl and index the content and the links on such pages. The exception is if the URLs were previously indexed, in which case they will pass PageRank.

It is interesting to note that pages with Google+ buttons may be visited by Google when someone clicks on the plus button, ignoring the robots.txt directives.[19]

One of the biggest misconceptions about robots.txt is that it can be used to control duplicate content. The fact is, there are better methods for controlling duplicate content, and robots.txt should only be used to control crawler access. That being said, there may be cases where one does not have control over how the content management system generates the content, or cases when one cannot make changes to pages generated on the fly. In such situations, one can try as a last resort to control duplicate content with robots.txt.

Every ecommerce website is unique, with its own specific business needs and requirements, so there is no general rule for what should be crawled and what should not. Regardless of your website particularities, you will need to manage duplicate content by either using rel=“canonical” or HTTP headers.

While tier-one search engines will not attempt to “add to cart” and will not start a checkout process or a newsletter sign-up on purpose, coding glitches may trigger them to attempt to access unwanted URLs. Considering this, here are some common types of URLs you can block access to:

Shopping cart and checkout pages
Add to Cart, View Cart, and other checkout URLs can safely be added to robots.txt.

If the View Cart URL is mysite.com/viewcart.aspx, you can use the following commands to disallow crawling:

User-agent: *
# Do not crawl view cart URLs
Disallow: *viewcart.aspx
# Do not crawl add to cart URLs
Disallow: *addtocart.aspx
# Do not crawl checkout URLs
Disallow: /checkout/

The above directives mean that all bots are forbidden to crawl any URL that contains viewcart.aspx or addtocart.aspx. Also, all the URLs under the /checkout/ directory are off-limits.

Robots.txt allows limited use of regular expressions to match URL patterns, so your programmers should be able to play with a large spectrum of URLs. When you use regular expressions, the star symbol means “anything”, the dollar sign means “ends with”, and the caret sign means “starts with”.

User account pages
Account URLs such as Account Login can be blocked as well:
User-agent: *
# Do not crawl login URLs
Disallow: /store/account/*.aspx$

The above directive means that all pages under the /store/account/ directory will not be crawled.

Below are some other types of URLs that you can consider blocking.

Figure 109 – These are some other types of pages that you can consider blocking.

A couple of notes about the resources highlighted in yellow:

  • If you are running an ecommerce on WordPress, you may want to let search engine bots crawl the URLs under the tag directory; there were times when you had to block the tag pages, but not anymore.
  • The /includes/ directory should not contain scripts that are used for rendering content on pages. Block it only if you host the scripts necessary to create the undiscoverable links inside /includes/.
  • The same for the /scripts/ and /libs/ directories – do not block them if they contain resources necessary for rendering content.

Duplicate or near duplicate content issues such as pagination and sorting are not optimally addressed with robots.txt.
Before you upload the robots.txt file, I recommend testing it against your existing URLs. First, generate the list of URLs on your website using one of the following methods:

  • Ask for help from your programmers.
  • Crawl the entire website with your favorite crawler.
  • Use weblog files.

Then, open this list in a text editor that allows searching by regular expressions. Software like RegexBuddy, RegexPal or Notepad++ are good choices. You can test the patterns you used in the robots.txt file using these tools, but keep in mind that you might need to slightly rewrite the regex pattern you used in the robots.txt, depending on the software you use.

Let’s say that you want to block crawlers’ access to email landing pages, which are all located under the /ads/ directory. Your robots.txt will include these lines:

User-agent: *
# Do not crawl view cart URLs
Disallow: /ads/
Using RegexPal, you can test the URLs list using this simple regex: /ads/

Figure 110 – RegexPal automatically highlights the matched pattern.

If you work with large files that contain hundreds of thousands of URLs, use Notepad++ to match URLs with regular expressions, because Notepad++ can easily handle large files.

For example, let’s say that you want to block all URLs that end with .js. The robots.txt will include this line:

Disallow: /*.js$
To find which URLs in your list match the robots.txt directives using Notepad++ you will input “\.js” in the “Find what” field and then, use the Regular expression Search Mode:

Figure 111 – Regular expression search more in Notepad++

Skimming through the highlighted matching URLs marked with yellow can clear doubts about which URLs will be excluded with robots.txt.

When you need to block crawlers from accessing media such as videos, images or .pdf files, use the X-Robots-Tag HTTP header[20] instead of the robots.txt file.
However, remember, if you want to address duplicate content issues for non-HTML documents, use rel=“canonical” headers.[21]

The exclusion parameter

With this technique, you selectively add a parameter (e.g., crawler=no) or a string (e.g., ABCD-9) to the URLs that you want to be inaccessible, and then you block that parameter or string with robots.txt.

First, decide which URLs you want to block.

Let’s say that you want to control the crawling of the faceted navigation by not allowing search engines to crawl URLs generated when applying more than one filter value within the same filter (also known as multi-select). In this case, you will add the crawler=no parameter to all URLs generated when a second filter value is selected on the same filter.

If you want to block bots when they try to crawl a URL generated by applying more than two filter values on different filters, you will add the crawler=no parameter to all URLs generated when a third filter value is selected, no matter which options were chosen, nor the order they were chosen. Here’s a scenario for this example:

The crawler is on the Battery Chargers subcategory page.
The hierarchy is: Home > Accessories > Battery Chargers
The page URL is: mysite.com/accessories/motorcycle-battery-chargers/

Then, the crawler “checks” one of the Brands filter values, Noco. This is the first filter value, and therefore you will let the crawler fetch that page.
The URL for this selection does not contain the exclusion parameter:
mysite.com/accessories/motorcycle-battery-chargers?brand=noco

The crawler now checks one of the Style filter values, cables. Since this is the second filter value applied, you will still let the crawler access the URL.
The URL still does not contain the exclusion parameter. It contains just the brand and style parameters:
mysite.com/accessories/motorcycle-battery-chargers?brand=noco&style=cables

Now, the crawler “selects” one of the Pricing filter values, the number 1. Since this is the third filter value, you will append the crawler=no to the URL.
The URL becomes:
mysite.com/accessories/motorcycle-battery-chargers?brand=noco&style=cables&pricing=1&crawler=no

If you want to block the URL above, the robots.txt file will contain:User-agent: *
Disallow: /*crawler=no

The method described above prevents the crawling of facet URLs when more than two filters values have been applied, but it does not allow specific control over which filters are going to be crawled and which ones not. For example, if the crawler “checks” the Pricing options first, the URL containing the pricing parameter will be crawled. We will discuss faceted navigation in detail later on.

URL parameters handling

URL parameters can cause crawl efficiency problems as well as duplicate content issues. For example, if you implement sorting, filtering, and pagination with parameters, then you are likely to end up with a large number of URLs, which will waste crawl budget. In a video about parameters handling, Google shows[22] how 158 products on googlestore.com generated an astonishing 380,000 URLs for crawlers.

Controlling URL parameters within Google Search Console and Bing Webmaster Tools can improve crawl efficiency, but it will not address the causes of duplicate content. You will still need to fix canonicalization issues, at the source. However, since ecommerce websites use multiple URL parameters, controlling them correctly with webmaster tools may prove tricky and risky. Unless you know what you are doing, you are better off using either a conservative setup or the default settings.

URL parameters handling is mostly used for deciding which pages to index and which page to canonicalize to.

One advantage of handling URL parameters within webmaster accounts is that page-level directives (i.e., rel=“canonical” or meta noindex) will still apply as long as the pages containing such directives are not blocked with robots.txt or with other methods. However, while it is possible to use limited regular expressions within robots.txt to prevent the crawling of URLs with parameters, robots.txt will overrule page-level and element-level directives.

Figure 112 – A Google Search Console notification regarding URL parameters.

Sometimes there are cases where you do not have to play with the URL parameters settings. In this screenshot, you can see a message saying that Google has no issues with categorizing your URL parameters. If Google can crawl the entire website without difficulty, you can leave the default settings as they are. If you want to set up the parameters, click on the Configure URL parameters link.

Figure 113 – This screenshot is for an ecommerce website with fewer than 1,000 SKUs. You can see how the left navigation generated millions of URLs.

In the previous screenshot, the limit key (used for changing the number of items listed on the category listing page) generated 6.6 million URLs when combined with other possible parameters. However, because this website has strong authority, it gets a lot of attention and love from Googlebot, and it does not have crawling or indexing issues.

When handling parameters, the first thing you want to decide is which parameters change the content (active parameters) and which ones do not (passive parameters). You are best to do this with your programmers because they will know the usage of parameters the best. Parameters that do not affect how content is displayed on a page (e.g., user tracking parameters) are a safe target for exclusion.

Although Google by itself does a good job at identifying parameters that do not change content, it is still worthwhile to set them manually.

To change the settings for such parameters, click Edit:

Figure 114 – Controlling URL parameters within Google Search Console.

In our example, the parameter utm_campaign was used to track the performance of internal promotions, and it does not change the content on the page. In this scenario, choose “No: Does not affect page content (ex: track usage)”.

Figure 115 – Urchin Tracking Module parameters (widely known as UTMs), can safely be consolidated to the representative URLs.

To make sure you are not blocking the wrong parameters, test the sample URLs by loading them in the browser. Load the URL and see what happens if you remove the tracking parameters. If the content does not change, then it can be safely excluded.

On a side note, tracking internal promotions with UTM parameters is not ideal. UTM parameters are designed for tracking campaigns outside your website. If you want to track the performance of your internal marketing banners, then use other parameter names or use event tracking.

Some other common parameters that you may consider for exclusion are session IDs, UTM tracking parameters (utm_source, utm_medium, utm_term, utm_content, and utm_campaign) and affiliate IDs.
A word of caution is necessary here, and this recommendation comes straight from Google.[23]

“Configuring site-wide parameters may have severe, unintended effects on how Google crawls and indexes your pages. For example, imagine an ecommerce website that uses storeID in both the store locator and to look up a product’s availability in a store:
/store-locator?storeID=123
/product/foo-widget?storeID=123
If you configure storeID to not be crawled, both the /store-locator and /foo-widget paths will be affected. As a result, Google may not be able to index both kind of URLs, nor show them in our search results. If these parameters are used for different purposes, we recommend using different parameter names”.

In the scenario above, you can keep the store location in a cookie.

Things get more complicated when parameters change how the content is displayed on a page.

One safe setup for content-changing parameters is to suggest to Google how the parameter affects the page (e.g., sorts, narrows/filters, specifies, translates, paginates, others), and use the default option Let Google decide. This approach will allow Google to crawl all the URLs that include the targeted parameter.

Figure 116 – A safe setup it to let Google know that a parameter changes the content, and let Google decide what to do with the parameter.

In the previous example, I knew that the mid parameter changes the content on the page, so I pointed out to Google that the parameter sorts items. However, when it came to deciding which URLs to crawl, I let Google do it.

The reason I recommend letting Google decide is because of the way Google chooses canonical URLs: it groups duplicate content URLs into clusters based on internal linking (PageRank), external link popularity, and content. Then Google finds the best URL to surface in search results, for each cluster of duplicate content. Since Google does not share the complete link graph of your website, you will not know which URLs are linked the most, so you may not always be able to choose the right URL to canonicalize to

  1. Google Patent On Anchor Text And Different Crawling Rates, http://www.seobythesea.com/2007/12/google-patent-on-anchor-text-and-different-crawling-rates/
  2. Large-scale Incremental Processing Using Distributed Transactions and Notifications, http://research.google.com/pubs/pub36726.html
  3. Our new search index: Caffeine, http://googleblog.blogspot.ca/2010/06/our-new-search-index-caffeine.html
  4. Web crawler, http://en.wikipedia.org/wiki/Web_crawler#Politeness_policy
  5. To infinity and beyond? No!, http://googlewebmastercentral.blogspot.ca/2008/08/to-infinity-and-beyond-no.html
  6. Crawl Errors: The Next Generation, http://googlewebmastercentral.blogspot.ca/2012/03/crawl-errors-next-generation.html
  7. Make Data Useful, http://www.scribd.com/doc/4970486/Make-Data-Useful-by-Greg-Linden-Amazon-com
  8. Shopzilla’s Site Redo – You Get What You Measure, http://www.scribd.com/doc/16877317/Shopzilla-s-Site-Redo-You-Get-What-You-Measure
  9. Expires Headers for SEO: Why You Should Think Twice Before Using Them, http://moz.com/ugc/expires-headers-for-seo-why-you-should-think-twice-before-using-them
  10. How Website Speed Actually Impacts Search Ranking, http://moz.com/blog/how-website-speed-actually-impacts-search-ranking
  11. Optimizing your very large site for search — Part 2, http://web.archive.org/web/20140527160343/http://www.bing.com/blogs/site_blogs/b/webmaster/archive/2009/01/27/optimizing-your-very-large-site-for-search-part-2.aspx
  12. Matt Cutts Interviewed by Eric Enge, http://www.stonetemple.com/articles/interview-matt-cutts-012510.shtml
  13. Save bandwidth costs: Dynamic pages can support If-Modified-Since too, http://sebastians-pamphlets.com/dynamic-pages-can-support-if-modified-since-too/
  14. Site Map Usability, http://www.nngroup.com/articles/site-map-usability/
  15. New Insights into Googlebot, http://moz.com/blog/googlebot-new-insights
  16. How Bing Uses CTR in Ranking, and more with Duane Forrester, http://www.stonetemple.com/search-algorithms-and-bing-webmaster-tools-with-duane-forrester/
  17. Multiple XML Sitemaps: Increased Indexation and Traffic, http://moz.com/blog/multiple-xml-sitemaps-increased-indexation-and-traffic
  18. How Bing Uses CTR in Ranking, and more with Duane Forrester, http://www.stonetemple.com/search-algorithms-and-bing-webmaster-tools-with-duane-forrester/
  19. How does Google treat +1 against robots.txt, meta noindex or redirected URL, https://productforums.google.com/forum/#!msg/webmasters/ck15w-1UHSk/0jpaBsaEG3EJ
  20. Robots meta tag and X-Robots-Tag HTTP header specifications, https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag
  21. Supporting rel=”canonical” HTTP Headers, http://googlewebmastercentral.blogspot.ca/2011/06/supporting-relcanonical-http-headers.html
  22. Configuring URL Parameters in Webmaster Tools, https://www.youtube.com/watch?v=DiEYcBZ36po&feature=youtu.be&t=1m50s
  23. URL parameters, https://support.google.com/webmasters/answer/1235687?hl=en