Ecommerce SEO

CHAPTER 4

Crawl Optimization

Length: 6,918 words

Estimated reading time: 50 minutes

This e-commerce SEO guide has almost 400 pages of advanced, actionable insights into on-page SEO for e-commerce. This is the fourth of eight chapters.

Written by an e-commerce SEO consultant with over 25 years of research and practical experience, this comprehensive SEO resource will teach you how to identify and address all SEO issues specific to e-commerce websites in one place.

The strategies and tactics described in this guide have been successfully implemented on top 10 online retailers, small & medium businesses, and mom-and-pop stores.

Please share and link to this guide if you like it.

Crawl Optimization

Crawl optimization aims to help search engines discover URLs efficiently. Relevant pages should be easy to reach, while less important pages should not waste the so-called “crawl budget” and should not create crawl traps. The crawl budget is defined as the number of URLs search engines can and want to crawl.

Search engines assign a crawl budget to each website, depending on the authority of the website. Generally, the authority of a site is somehow proportional to its PageRank.

The crawl budget concept is essential for e-commerce websites because they usually comprise a vast number of URLs—from tens of thousands to millions.

Suppose the technical architecture puts the search engine crawlers (robots, bots, or spiders) in infinite loops or traps. In that case, the crawl budget will be wasted on unimportant pages for users or search engines, which may leave important pages out of search engines’ indices.

Additionally, crawl optimization is where very large websites can take advantage of the opportunity to have more critical pages indexed and low PageRank pages crawled more frequently.[1]

The number of URLs Google can index increased dramatically after introducing their Percolator[2] architecture (with the “Caffeine” update[3] ). However, it is still important to check what resources search engine bots request on your website and to prioritize crawling accordingly.

Before we begin, it is important to understand that crawling and indexing are different processes. Crawling means just fetching files from websites. Indexing means analyzing the files and deciding whether they are worthy of inclusion. So, even if search engines crawl a page, they will not necessarily index it.

Crawling is influenced by several factors, such as the website’s structure, internal linking, domain authority, URL accessibility, content freshness, update frequency, and the crawl rate settings in webmaster tools accounts.

Before detailing these factors, let’s discuss tracking and monitoring search engine bots.

Tracking and monitoring bots

Googlebot, Yahoo! Slurp, and Bingbot are polite bots [4], which means that they will obey the crawling directives found in robots.txt files before requesting resources from your website. Polite bots will identify themselves to the web server so you can control them. The requests made by bots are stored in your log files and are available for analysis.

Webmaster tools, such as the ones provided by Google and Bing, only uncover a small part of what bots do on your website—e.g., how many pages they crawl or bandwidth usage data. That is useful in some ways but is not enough.

For really useful insights, you have to analyze the traffic log files. From there, you can extract information that can help identify large-scale issues.

Log file analysis was traditionally performed using the grep command line with regular expressions. But lately, there have also been desktop and web-based solutions that will make this geek analysis easier and more accessible to marketers.

On ecommerce websites, monthly log files are usually huge—gigabytes or even terabytes of data. However, you do not need all the data inside the log files to be able to track and monitor search engine bots. You only need the lines generated by bot requests. This way, you can significantly reduce the size of the log files from gigabytes to megabytes.

Using the following Linux command line (case sensitive) will extract just the lines containing “Googlebot” from one log file (access_log.processed) to another (googlebot.log):
grep “Googlebot” access_log.processed > googlebot.log

To extract similar data for Bing and other search engines, replace “Googlebot” with other bot names.

Figure 86 – The log file was reduced from 162.5Mb to 1.4Mb.

Open the bot-specific log file with Excel, go to Data –> Text to Columns, and use Delimited with Space to enter the log file data into a table format like this one:

Figure 87 – Status filters the data to get a list of all 404 Not Found errors encountered by Googlebot.

Note: you can import only up to one million rows in Excel; if you need to import more, use MS Access or Notepad++.

To quickly identify crawling issues at category page levels, chart the Googlebot hits for each category. This is where the advantage of category-based navigation and URL structure comes in handy.

Figure 88 – The/bracelets/ directory needs some investigation because there are too few bot requests compared to the other directories.

By pivoting the log file data by URLs and crawl date, you can identify content that gets crawled less often:

Figure 89 – The dates the URLs have been fetched.

This pivot table shows that although the three URLs are positioned at the same level in the hierarchy, URL number three gets crawled much more often than the other two. This is a sign that URL #3 is deemed more important.

Figure 90 – More external backlinks and social media mentions may increase crawl frequency.

Here are some issues and ideas you should consider when analyzing bot behavior using log files:

  • Analyze server response errors and identify what generates those errors.
  • Discover unnecessarily crawled pages and crawling traps.
  • Correlate days since the last crawl with rankings; when you make changes on a page, make sure to re-crawl it; otherwise, the updates won’t be considered for rankings
  • Discover whether products listed at the top of listings are crawled more often than products listed on component pages (paginated listings). Consider moving the most important products on the first page rather than having them on component pages.
  • Check the frequency and depth of the crawl.

The goal of tracking bots is to:

  • Establish where the crawl budget is used.
  • Identify unnecessary requests (e.g., “Write a Review” links that open pages with the exact content except for the product name, e.g., mysite.com/review.php?pid=1, mysite.com/review.php?pid=2, and so on).
  • Fix the leaks.

Instead of wasting budget on unwanted URLs (e.g., duplicate content URLs), focus on sending crawlers to pages that matter to you and your users.

Another useful application of log files is to evaluate the quality of backlinks. Rent links from various external websites and point them at pages with no other backlinks (product detail pages or pages that support product detail pages). Then, analyze the spider activity on those pages. If the crawl frequency increases, that link is more valuable than a link that does not increase spider activity. An increase in crawling frequency on your pages suggests that the page you got the link from also gets often crawled, which means the linking page has good authority. Once you identified good opportunities, work to get natural links from those websites.

Flat website structure

Suppose there are no other technical impediments to crawling large websites (e.g., crawlable facets or infinite spaces[5]). In that case, a flat website architecture can help crawling by allowing search engines to reach deep pages in very few hops, therefore using the crawl budget very efficiently.

Pagination—specifically, de-pagination—is one way to flatten your website architecture. We will discuss pagination later in the Listing Pages section.

For more information on flat website architecture, please refer to the section titled The Concept of Flat Architecture in the Site Architecture section.

Accessibility

I will refer to accessibility in terms of optimization for search engines rather than optimization for users.

Accessibility is probably a critical factor for crawling. Your crawl budget is dictated by how the server responds to bot traffic. If your website’s technical architecture makes it impossible for search engine bots to access URLs, then those URLs will not be indexed. URLs already indexed but not accessible after a few unsuccessful attempts may be removed from search engine indices. Google crawls new websites at a low rate, then gradually increases to a level that does not create accessibility issues for your users or your server.

So, what prevents URLs and content from being accessible?

DNS and connectivity issues
Use http://www.intodns.com/ to check for DNS issues. Everything in red and yellow needs your attention (even if it is an MX record).

Figure 91 – Report from intodns.com.

Using Google and Bing webmaster accounts, fix all the issues related to DNS and connectivity:

Figure 92 – Bing’s Crawl Information report.

Figure 93 – Google’s Site Errors report in the old GSC.[6]

One DNS issue you may want to pay attention to is related to wildcard DNS records, which means the web server responds with a 200 OK code for any subdomain request, even for ones that do not exist. Unrecognizable hostnames are an even more severe DNS-related problem (the DNS lookup fails when trying to resolve the domain name.)

One large retailer had another misconfiguration. Two of its country code top-level domains (ccTLDs)—the US (.com) and the UK (.co.uk)—resolved to the same IP. If you have multiple ccTLDs, host them on different IPs (ideally from within the country you target with the ccTLD), and check how the domain names resolve.

If your web servers are down, no one can access the website (including search engine bots). Server tools like Monitor.Us, Scoutt, or Site24x7 can help you monitor your site’s availability.

Host load
Host load represents the maximum number of simultaneous connections a web server can handle. Every page load request from Googlebot, Yahoo! Slurp, or Bingbot generates a connection with your web server. Since search engines use distributed crawling from multiple machines simultaneously, you can theoretically reach the limits of the connections, and your website will crash (especially if you are on a shared hosting plan).

Use tools like the one found at loadimpact.com to check how many connections your website can handle. But be careful; your site can become unavailable or even crash during such tests.

Figure 94 – If your website loads in under two seconds when used by many visitors, you should be fine. The graph was generated by loadimpact.com.

Page load time
Page load time is not only a crawling factor but also a ranking and usability factor. Amazon reportedly increased its revenue by 1% for every 100ms of load time improvement,[7] and Shopzilla increased revenue by seven to 12% by decreasing the page load time by five seconds.[8]

There are plenty of articles about page load speed optimization, and they can get pretty technical. Here are a few pointers to summarize how you can optimize load times:

  • Defer loading of images until needed for display in the browser.
  • Use CSS sprites.
  • Use http2 protocols.

Figure 95 – Amazon uses CSS sprites to minimize the number of requests to their server.

Figure 96 – Apple used sprites for their main navigation.

  • Use content delivery networks for media (and other files that do not update often.
  • Implement database and cache (server-side caching) optimization.
  • Enable HTTP compression and implement conditional GET.
  • Optimize images.
  • Use expires headers.[9]
  • Ensure fast and responsive design to decrease the time to the first byte (TTFB). Use http://webpagetest.org/ to measure TTFB. There seems to be a clear correlation between lower rankings and increased TTFB.[10]

If your URLs load slowly, search engines may interpret this as a connectivity issue, meaning they will give up crawling the troubled URLs.

The time Google spends on a page seems to influence the number of pages it crawls. The less time to download a page, the more pages are crawled.

Figure 97 – The correlation between the time spent downloading a page and the pages crawled per day seems apparent in this graph.

Broken links
This is a no-brainer. When your internal links are broken, crawlers cannot find the correct pages. Run a full crawl on the entire website with the crawling tool of your choice and fix all broken URLs. Also, use the webmaster tools provided by search engines to find broken URLs.

HTTP caching with Last-Modified/If-Modified-Since and E-Tag headers
Regarding crawling optimization, “cache” refers to a page stored in a search engine index. Note that caching is a highly technical issue, and improper caching settings may make search engines crawl and index a website chaotically.

When a search engine requests a resource on your website, it first requests your web server to check its status. The server replies with a header response. Based on the header response, search engines download or skip the resource.

Many search engines check whether the resource they request has changed since they last crawled it. If it has, they will fetch it again—if not, they will skip it. This mechanism is referred to as conditional GET. Bing confirmed it uses the If-Modified-Since header[11]. Google too.[12]

Below is the header response for a newly discovered page that supports the If-Modified-Since header when a request is made to access it.

Figure 98 – Use the curl command to get the last modified date.

When the bot requests the same URL the next time, it will add an If-Modified-Since header request. If the document has not been modified, it will respond with a 304 status code (Page Not Modified):

Figure 99 – A 304 response header

If-Modified-Since will return 304 Not Modified if the page has not been changed. If modified, the header response will be 200 OK, and the search engine will fetch the page again.

The E-Tag header works similarly but is more complicated to handle.

If your ecommerce platform uses personalization or the content on each page changes frequently, implementing HTTP caching may be more challenging, but even dynamic pages can support If-Modified-Since.[13]

Sitemaps

There are two major types of sitemaps:

You can also submit Sitemaps in the following format: plain text files, RSS, or mRSS.
If you experience crawling and indexing issues, remember that sitemaps are just a patch for more severe problems such as duplicate content, thin content, or improper internal linking. Creating sitemaps is a good idea, but it will not fix those issues.

HTML sitemaps

HTML sitemaps are a form of secondary navigation. They are usually accessible to humans and bots through a link in the footer at the bottom of the website.

A usability study on various websites, including ecommerce websites, found that people rarely use HTML sitemaps. In 2008, only 7% of the users turned to the sitemap when asked to learn about a site’s structure,[14] down from 27% in 2002. Nowadays, the percentage is probably even lower.

Still, HTML sitemaps are handy for sending crawlers to pages at the lower levels of the website taxonomy and for creating flat internal linking.

Figure 100 – Sample flat architecture.

Here are some optimization tips for HTML sitemaps:

Use segmented sitemaps
When optimizing HTML sitemaps for crawling, it is important to remember that PageRank is divided between all the links on a page. Splitting the HTML sitemap into multiple smaller parts is a good way to create more user and search-engine-friendly pages for large websites, such as e-commerce.

Instead of a huge sitemap page that links to almost every page on your website, create a main sitemap index page (e.g., sitemap.html) and link from it to smaller sitemap component pages (sitemap-1.html, sitemap-2.html, etc.).

You can split the HTML sitemaps based on topics, categories, departments, or brands. Start by listing your top categories on the index page. How you split the pages depends on your catalog’s number of categories, subcategories, and products. You can use the “100 links per page” rule below as a guideline but do not get stuck on this number, especially if your website has good authority.

If you have over 100 top-level categories, you should display the first 100 on the site map index page and the rest on additional sitemap pages. You can allow users and search engines to navigate the sitemap using previous and next links (e.g., “see more categories”).

If you have fewer than 100 top-level categories in the catalog, you will have room to list several important subcategories as well, as depicted below:

Figure 101- A clean HTML sitemap example.

The top-level categories in this site map are Photography, Computers & Solutions, and Pro Audio. Since this business has a limited number of top-level categories, there is room for several subcategories (Digital Cameras, Laptops, Recording).

Do not link to redirects
The URLs linked from sitemap pages should land crawlers on the final URLs rather than go through URL redirects.

Enrich the sitemaps
Adding extra data by annotating links with info is good for users and can provide some context for search engines. You can add data such as product thumbnails, customer ratings, manufacturer names, etc.

These are just some suggestions for HTML sitemaps to make the pages easier for people to read and very lightly linked for crawlers. However, the best way to help search engines discover content on your website is to feed them a list of URLs in different file formats, one of which is XML.

XML Sitemaps

Modern e-commerce platforms should auto-generate XML Sitemaps, but often, the default output file is not optimized for crawling and analysis. Therefore, it is important to manually review and optimize the automated output or generate the Sitemaps using your own rules.

Unless you have concerns about competitors spying on your URL structure, it is preferable to include the path of the XML Sitemap file within the robots.txt file.

Search engines request Robots.txt every time they start a new crawling session on your website. They analyze it to see if it has been modified since the last crawl. If it wasn’t modified, search engines use the existing robots.txt cached file to determine which URLs can be crawled.

If you do not specify the location of your XML Sitemap inside robots.txt, search engines will not know where to find it (except if you submitted it within the webmaster accounts). Submitting to Google Search Console or Bing Webmaster allows access to more insights, such as how many URLs have been submitted, how many are indexed, and what errors may be present in the XML file.

Figure 102 – If you have an almost 100% indexation rate, you probably do not need to worry about crawl optimization.

Using XML Sitemaps seems to have an accelerating effect on the crawl rate:

“At first, the number of visits was stabilized at a rate of 20 to 30 pages per hour. As soon as the sitemap was uploaded through Webmaster Central, the crawler accelerated to approximately 500 pages per hour. In just a few days it reached a peak of 2,224 pages per hour. Where at first the crawler visited 26.59 pages per hour on average, it grew to an average of 1,257.78 pages per hour which is an increase of no less than 4,630.27%”.[15]

Here are some tips for optimizing XML Sitemaps for large websites:

  • Add only URLs that respond with 200 OK; too many errors and search engines will stop trusting your Sitemaps. Bing has,

“a 1% allowance for dirt in a Sitemap. Examples of dirt are if we click on a URL and we see a redirect, a 404 or a 500 code. If we see more than a 1% level of dirt, we begin losing trust in the Sitemap”.[16]

Google is less stringent than Bing; they do not care about the errors in the XML Sitemap.

  • Have no links to duplicate content and no URLs that canonicalize to different URLs—only to “end state” URLs.
  • Place videos, images, news, and mobile URLs in separate Sitemaps. You can use video sitemaps for videos, but mRSS formatting is also supported.
  • Segment the Sitemaps by topic or category and by subtopic or subcategory. For example, you can have a sitemap for your camping category – sitemap_camping.xml, another one for your Bicycles category – sitemap_cycle.xml, and another one for the Running Shoes category – sitemap_run.xml. This segmentation does not directly improve organic rankings, but it will help identify indexation issues at granular levels.
  • Create separate Sitemap files for product pages — segment by the lowest level of categorization.
  • Fix Sitemap errors before submitting your files to search engines. You can do this within your Google Search Console account using the Test Sitemap feature:

Figure 103 – The Test Sitemap feature in Google Search Console.

  • Keep language-specific URLs in separate Sitemaps.
  • Do not assign the same weight to all pages (your scoring can be based on update frequency or other business rules).
  • Auto-update the Sitemaps whenever important URLs are created.
  • Include only URLs that contain essential and important filters (see section Product Detail Pages).

You probably noticed a commonality within these tips: segmentation. It is a good idea to split your XML files as much as you can without overdoing it (e.g., just 10 URL per file), so you can identify and fix indexation issues more easily.[17]

Remember that sitemap, either XML or HTML, should not be used as a substitute for poor website architecture or other crawling issues but only as a backup. Ensure there are other paths for crawlers to reach all important pages on your website (e.g., internal contextual links).

Here are some factors that can influence the crawl budget:

Popularity
Crawlers will request pages more frequently if they find more external and internal links pointing to them. Most ecommerce websites experience challenges building links to category and product detail pages, but this has to be done. Guest posting, giveaways, link bait, evergreen content, outright link requests within confirmation emails, ambassador programs, and perpetual holiday category pages are just some of the tactics that can help with link development.

Crawl rate settings
You can alter (usually decrease) the crawl rate of Googlebot using your Google Search Console account. However, changing the rate is not advisable unless the crawler slows down your web server.

With Bing’s Crawl Control feature, you can set up dayparting:

Figure 104 – Bing’s Crawl Control Interface.

Fresh content
Updating content on pages and then pinging search engines (i.e., creating feeds for product and category pages) should quickly get the crawlers to the updated content.

If you update fewer than 300 URLs per month, you can use the Fetch as Google feature inside your Google Search Console account to get the updated URLs re-crawled quickly. You can also create and submit a new XML sitemap for the updated or new pages regularly (e.g., weekly).

There are several ways to keep your content fresh. For example, you can include an excerpt of about 100 words from related blog posts on product detail pages. Ideally, the excerpt should include the product name and links to parent category pages. Every time you mention a product in a new blog post, update the excerpt of the product detail page, as well.

You can even include excerpts from articles that do not directly mention the product name if the article is related to the category in which the product can be classified.

Figure 105 – The “From Our Blog” section keeps this page updated and fresh.

Another great tactic for keeping the content fresh is continuously generating user reviews, product questions and answers, or other user-generated content.

Figure 106 – Ratings and reviews are a smart way to update pages, especially for high-demand products.

Domain authority
The higher your website’s domain authority, the more visits search engine crawlers will pay. Your domain authority increases by pointing more external links to your website—this is much easier said than done.

RSS feeds
RSS feeds are one of the fastest ways to notify search engines of new products, categories, or fresh content on your website. Here’s what Duane Forrester (former Bing’s Webmaster senior product manager) said in the past about RSS feeds:

“Things like RSS are going to become a desired way for us to find content … It is a dramatic cost savings for us”.[18]

With the help of RSS, you can get search engines to crawl the new content within minutes of publication. For example, if you write SEO content to support category and product detail pages and link smartly from such supporting pages, search engines will also request and crawl the linked-to product and category URLs.

Figure 107 – Zappos has an RSS feed for brand pages. Users (and search engines) are instantly notified every time Zappos adds a new product from a brand.

Guiding crawlers

The best way to avoid wasting the crawl budget on low-value-added URLs is to avoid creating links to those URLs in the first place. However, that is not always an option. For example, you must allow people to filter products based on three or more product attributes. Alternatively, you may want to allow users to email a friend from product detail pages. Or, you have to give users the option to write product reviews.

If you create unique URLs for “Email to a Friend ” links, you may create duplicate content.

Figure 108 – The URLs in the image above are near-duplicates. However, these URLs do not have to be accessible to search engines. Block the email-friend.php file in robots.txt

These “Email to a Friend” URLs will most likely lead to the same web form, and search engines will unnecessarily request and crawl hundreds or thousands of such links, depending on the size of your catalog. You will waste the crawl budget by allowing search engines to discover and crawl these URLs.

It would be best to control which links are discoverable by search engine crawlers and which are not. The more unnecessary requests for junk pages a crawler makes, the fewer chances it has to reach more important URLs.

Crawler directives can be defined at various levels, in this priority:
Site-level, using robots.txt.

  • Page-level, with the noindex meta tag and with HTTP headers.
  • Element-level, using the nofollow microformat.

Site-level directives overrule page-level directives, and page-level directives overrule element-level directives. It is important to understand this priority because, for a page-level directive to be discovered and followed, the site-level directives should allow access to that page. The same applies to element-level and page-level directives.

On a side note, if you want to keep content as private as possible, one of the best ways is to use server-side authentication to protect areas.

Robots.txt

Although robots.txt files can assist in controlling crawler access, the URLs disallowed with robots.txt may still end up in search engine indices because of external backlinks pointing to the “robot-ed” URLs. This suggests that URLs blocked with robots.txt can accumulate PageRank. However, URLs blocked with robots.txt will not pass PageRank since search engines cannot crawl and index the content and the links on such pages. The exception is if the URLs were previously indexed, in which case they will pass PageRank.

It is interesting to note that pages with Google+ buttons may be visited by Google when someone clicks the plus button, ignoring the robots.txt directives.[19]

One of the biggest misconceptions about robots.txt is that it can be used to control duplicate content. There are better methods for controlling duplicate content, and robots.txt should only be used to control crawler access. That being said, there may be cases where one does not have control over how the content management system generates the content or cases when one cannot make changes to pages generated on the fly. In such situations, one can try to control duplicate content with robots.txt as a last resort.

Every ecommerce website is unique, with its own specific business needs and requirements, so there is no general rule for what should be crawled and what should not. Regardless of your website particularities, you must manage duplicate content using rel= “canonical” or HTTP headers.

While tier-one search engines will not attempt to “add to cart” and will not start a checkout process or a newsletter sign-up on purpose, coding glitches may trigger them to attempt to access unwanted URLs. Considering this, here are some common types of URLs you can block access to:

Shopping cart and checkout pages
Add to Cart, View Cart, and other checkout URLs can safely be added to robots.txt.

If the View Cart URL is mysite.com/viewcart.aspx, you can use the following commands to disallow crawling:

User-agent: *
# Do not crawl view cart URLs
Disallow: *viewcart.aspx
# Do not crawl add to cart URLs
Disallow: *addtocart.aspx
# Do not crawl checkout URLs
Disallow: /checkout/

The above directives mean that all bots are forbidden to crawl any URL that contains viewcart.aspx or addtocart. aspx. Also, all the URLs under the /checkout/ directory are off-limits.

Robots.txt allows limited use of regular expressions to match URL patterns, so your programmers should be able to play with many URLs. When you use regular expressions, the star symbol means “anything,” the dollar sign means “ends with, “and the caret sign means “starts with.”

User account pages
Account URLs such as Account Login can be blocked as well:
User-agent: *
# Do not crawl login URLs
Disallow: /store/account/*.aspx$

The above directive means no pages under the /store/account/ directory will be crawled.

Below are some other types of URLs that you can consider blocking.

Figure 109 – These are other types of pages you can consider blocking.

A couple of notes about the resources highlighted in yellow:

  • If you run e-commerce on WordPress, you may want to let search engine bots crawl the URLs under the tag directory. The recommendation was to block tag pages in the past, but not anymore.
  • The /includes/ directory should not contain scripts required to render page content. Block it only if you host the scripts necessary to create the undiscoverable links inside /includes/.
  • The same goes for the /scripts/ and /libs/ directories – do not block them if they contain resources necessary for rendering content.

Duplicate or near duplicate content issues such as pagination and sorting are not optimally addressed with robots.txt.
Before you upload the robots.txt file, I recommend testing it against your existing URLs. First, generate the list of URLs on your website using one of the following methods:

  • Ask for help from your programmers.
  • Crawl the entire website with your favorite crawler.
  • Use weblog files.

Then, open this list in a text editor that allows searching by regular expressions. Software like RegexBuddy, RegexPal, or Notepad++ are good choices. You can test the patterns you used in the robots.txt file using these tools, but remember that you might need to slightly rewrite the regex pattern you used in the robots.txt, depending on the software you use.

If you want to block crawlers’ access to email landing pages under the /ads/ directory, your robots.txt will include these lines:

User-agent: *
# Do not crawl view cart URLs
Disallow: /ads/
Using RegexPal, you can test the URL list using this simple regex: /ads/

Figure 110 – RegexPal automatically highlights the matched pattern.

If you work with large files that contain hundreds of thousands of URLs, use Notepad++ to match URLs with regular expressions because Notepad++ can easily handle large files.

For example, let’s say that you want to block all URLs that end with .js. The robots.txt will include this line:

Disallow: /*.js$
To find which URLs in your list match the robots.txt directives using Notepad++, you will input “\.js” in the “Find what” field and then use the Regular expression Search Mode:

Figure 111 – Regular expression search more in Notepad++

Skimming through the highlighted matching URLs marked with yellow can clear doubts about which URLs will be excluded with robots.txt. When blocking crawlers from accessing media such as videos, images, or .pdf files, use the X-Robots-Tag HTTP header[20] instead of the robots.txt file.

However, remember, if you want to address duplicate content issues for non-HTML documents, use rel= “canonical” headers.[21]

The exclusion parameter

With this technique, you selectively add a parameter (e.g., crawler=no) or a string (e.g., ABCD-9) to the URLs you want to be inaccessible, and then you block that parameter or string with robots.txt.

First, decide which URLs you want to block.

Let’s say that you want to control the crawling of the faceted navigation by not allowing search engines to crawl URLs generated when applying more than one filter value within the same filter (also known as multi-select). In this case, you will add the crawler=no parameter to all URLs generated when a second filter value is selected on the same filter.

Suppose you want to block bots when they try to crawl a URL generated by applying more than two filter values on different filters. In that case, you will add the crawler=no parameter to all URLs generated when a third filter value is selected, no matter which options were chosen or the order they were chosen. Here’s a scenario for this example:

The crawler is on the Battery Chargers subcategory page.
The hierarchy is: Home > Accessories > Battery Chargers
The page URL is: mysite.com/accessories/motorcycle-battery-chargers/

Then, the crawler “checks” one of the Brands filter values, Noco. This is the first filter value; therefore, you will let the crawler fetch that page.
The URL for this selection does not contain the exclusion parameter:
mysite.com/accessories/motorcycle-battery-chargers?brand=noco

The crawler now checks one of the Style filter values, cables. Since this is the second filter value applied, you will still let the crawler access the URL.
The URL still does not contain the exclusion parameter. It contains just the brand and style parameters:
mysite.com/accessories/motorcycle-battery-chargers?brand=noco&style=cables

Now, the crawler “selects” one of the Pricing filter values, the number 1. Since this is the third filter value, you will append the crawler=no to the URL.
The URL becomes:
mysite.com/accessories/motorcycle-battery-chargers?brand=noco&style=cables&pricing=1&crawler=no

If you want to block the URL above, the robots.txt file will contain:User-agent: *
Disallow: /*crawler=no

The method described above prevents the crawling of facet URLs when more than two filter values have been applied, but it does not allow specific control over which filters will be crawled and which ones will not. For example, if the crawler “checks” the Pricing options first, the URL containing the pricing parameter will be crawled. We will discuss faceted navigation in detail later on.

URL parameters handling

URL parameters can cause crawl efficiency problems and duplicate content issues. For example, if you implement sorting, filtering, and pagination with parameters, you will likely end up with many URLs, wasting the crawl budget. In a video about parameter handling, Google shows[22] how 158 products on googlestore.com generated an astonishing 380,000 URLs for crawlers.

Controlling URL parameters within Google Search Console and Bing Webmaster Tools can improve crawl efficiency, but it will not address the causes of duplicate content. You will still need to fix canonicalization issues at the source. However, since ecommerce websites use multiple URL parameters, controlling them correctly with webmaster tools may prove tricky and risky. Unless you know what you are doing, you are better off using either a conservative setup or the default settings.

URL parameters handling is mostly used to decide which pages to index and which to canonicalize.

One advantage of handling URL parameters within webmaster accounts is that page-level directives (i.e., rel= “canonical” or meta noindex) will still apply as long as the pages containing such directives are not blocked with robots.txt or other methods. However, while it is possible to use limited regular expressions within robots.txt to prevent the crawling of URLs with parameters, robots.txt will overrule page-level and element-level directives.

Figure 112 – A Google Search Console notification regarding URL parameters.

Sometimes, you do not have to play with the URL parameters settings. This screenshot shows a message saying that Google has no issues categorizing your URL parameters. You can leave the default settings if Google can easily crawl the entire website. To set up the parameters, click the Configure URL parameters link.

Figure 113 – This screenshot is for an ecommerce website with fewer than 1,000 SKUs. You can see how the left navigation generated millions of URLs.

In the previous screenshot, the limit key (used for changing the number of items listed on the category listing page) generated 6.6 million URLs when combined with other possible parameters. However, because this website has strong authority, it gets a lot of attention and love from Googlebot and does not have crawling or indexing issues.

When handling parameters, you first want to decide which ones change the content (active parameters) and which do not (passive parameters). It is best to do this with your programmers because they will know the best usage of parameters. Parameters that do not affect how content is displayed on a page (e.g., user tracking parameters) are a safe target for exclusion.

Although Google does a good job of identifying parameters that do not change content, it is still worthwhile to set them manually.

To change the settings for such parameters, click Edit:

Figure 114 – Controlling URL parameters within Google Search Console.

In our example, the parameter utm_campaign was used to track the performance of internal promotions, and it does not change the page’s content. In this scenario, choose “No: Does not affect page content (ex: track usage).”

Figure 115 – Urchin Tracking Module parameters (UTMs) can safely be consolidated to the representative URLs.

To ensure you are not blocking the wrong parameters, test the sample URLs by loading them in the browser. Load the URL and see what happens if you remove the tracking parameters. If the content does not change, then it can be safely excluded.

On a side note, tracking internal promotions with UTM parameters is not ideal. UTM parameters are designed to track campaigns outside your website. If you want to track the performance of your internal marketing banners, then use other parameter names or event tracking.

Some other common exclusion parameters you may consider are session IDs, UTM tracking parameters (utm_source, utm_medium, utm_term, utm_content, and utm_campaign), and affiliate IDs.
A word of caution is necessary here, and this recommendation comes straight from Google.[23]

“Configuring site-wide parameters may have severe, unintended effects on how Google crawls and indexes your pages. For example, imagine an ecommerce website that uses storeID in both the store locator and to look up a product’s availability in a store:
/store-locator?storeID=123
/product/foo-widget?storeID=123
If you configure storeID to not be crawled, both the /store-locator and /foo-widget paths will be affected. As a result, Google may not be able to index both kind of URLs, nor show them in our search results. If these parameters are used for different purposes, we recommend using different parameter names”.

You can keep the store location in a cookie in the scenario above.

Things get more complicated when parameters change how the content is displayed on a page.

One safe setup for content-changing parameters is to suggest to Google how the parameter affects the page (e.g., sorts, narrows/filters, specifies, translates, paginates, others) and use the default option Let Google decide. This approach will allow Google to crawl all the URLs that include the targeted parameter.

Figure 116 – A safe setup is to let Google know that a parameter changes the content and let Google decide what to do with the parameter.

In the previous example, I knew that the mid parameter changes the content on the page, so I pointed out to Google that the parameter sorts items. However, I let Google do it when deciding which URLs to crawl.

I recommend letting Google decide because of how Google chooses canonical URLs: it groups duplicate content URLs into clusters based on internal linking (PageRank), external link popularity, and content. Then, Google finds the best URL to display in search results for each cluster of duplicate content. Since Google does not share the complete link graph of your website, you will not know which URLs are linked the most, so you may not always be able to choose the right URL to canonicalize to

  1. Google Patent On Anchor Text And Different Crawling Rates, http://www.seobythesea.com/2007/12/google-patent-on-anchor-text-and-different-crawling-rates/
  2. Large-scale Incremental Processing Using Distributed Transactions and Notifications, http://research.google.com/pubs/pub36726.html
  3. Our new search index: Caffeine, http://googleblog.blogspot.ca/2010/06/our-new-search-index-caffeine.html
  4. Web crawler, http://en.wikipedia.org/wiki/Web_crawler#Politeness_policy
  5. To infinity and beyond? No!, http://googlewebmastercentral.blogspot.ca/2008/08/to-infinity-and-beyond-no.html
  6. Crawl Errors: The Next Generation, http://googlewebmastercentral.blogspot.ca/2012/03/crawl-errors-next-generation.html
  7. Make Data Useful, http://www.scribd.com/doc/4970486/Make-Data-Useful-by-Greg-Linden-Amazon-com
  8. Shopzilla’s Site Redo – You Get What You Measure, http://www.scribd.com/doc/16877317/Shopzilla-s-Site-Redo-You-Get-What-You-Measure
  9. Expires Headers for SEO: Why You Should Think Twice Before Using Them, http://moz.com/ugc/expires-headers-for-seo-why-you-should-think-twice-before-using-them
  10. How Website Speed Actually Impacts Search Ranking, http://moz.com/blog/how-website-speed-actually-impacts-search-ranking
  11. Optimizing your very large site for search — Part 2, http://web.archive.org/web/20140527160343/http://www.bing.com/blogs/site_blogs/b/webmaster/archive/2009/01/27/optimizing-your-very-large-site-for-search-part-2.aspx
  12. Matt Cutts Interviewed by Eric Enge, http://www.stonetemple.com/articles/interview-matt-cutts-012510.shtml
  13. Save bandwidth costs: Dynamic pages can support If-Modified-Since too, http://sebastians-pamphlets.com/dynamic-pages-can-support-if-modified-since-too/
  14. Site Map Usability, http://www.nngroup.com/articles/site-map-usability/
  15. New Insights into Googlebot, http://moz.com/blog/googlebot-new-insights
  16. How Bing Uses CTR in Ranking, and more with Duane Forrester, http://www.stonetemple.com/search-algorithms-and-bing-webmaster-tools-with-duane-forrester/
  17. Multiple XML Sitemaps: Increased Indexation and Traffic, http://moz.com/blog/multiple-xml-sitemaps-increased-indexation-and-traffic
  18. How Bing Uses CTR in Ranking, and more with Duane Forrester, http://www.stonetemple.com/search-algorithms-and-bing-webmaster-tools-with-duane-forrester/
  19. How does Google treat +1 against robots.txt, meta noindex, or redirected URL, https://productforums.google.com/forum/#!msg/webmasters/ck15w-1UHSk/0jpaBsaEG3EJ
  20. Robots meta tag and X-Robots-Tag HTTP header specifications, https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag
  21. Supporting rel=” canonical” HTTP Headers, http://googlewebmastercentral.blogspot.ca/2011/06/supporting-relcanonical-http-headers.html
  22. Configuring URL Parameters in Webmaster Tools, https://www.youtube.com/watch?v=DiEYcBZ36po&feature=youtu.be&t=1m50s
  23. URL parameters, https://support.google.com/webmasters/answer/1235687?hl=en