Domain Canonicalization

by Nathan Buggia

Pop quiz: what’s the difference between the following URLs:

  • http://website.com
  • http://www.website.com
  • http://website.com/default.php
  • http://www.website.com/default.php

Give up? If you’re a user, then chances you expect all of those URLs will lead you to the same page. Robots, however, are not as good at determining if pages are the same, so they often store each separately. A big part of how search engines rank pages is based on how many external links those pages have. If other sites on the web link to the different versions of your home page, then search engines may calculate the value of each URL separately, based on the number of links to each version. This can effectively diminish the potential rank your page would have if it were found (and linked to) by only one URL.

The practice of consolidating all versions of a page under one URL is referred to as “canonicalization” (because you collapse all versions under the “canonical” or true version). The four examples listed above are the most common, but there are potentially many, many URLs that lead you to the same page. By adhering to several best practices, you should be able to address 90% of common site-wide canonicalization issues on your site and consequently increase how your site ranks.

Recommendation

The solution is to be explicit about the canonical form of your URLs. Following are four best practices to achieve this, with specific code and configuration examples.

  1. Select WWW or Non-WWW, then redirect the other option to your preferred version.The hard part is choosing if you want your site to be “www.website.com” or simply “website.com”. There is no right answer for every company so you’ll have to figure this out on your own (but, removing the “www.” saves your customers 4 keystrokes, which really add up on a mobile device, and it makes your brand the first thing your customers see).Once you’ve selected, you then need to find a way to trap all requests to your application, check which form is being used, and if it is not the correct form, initiate a 301 Redirect to the correct form. For example, if the user types in wikipedia.org, they will automatically get redirected to www.wikipedia.org.
  2. Remove the default filename from the end of your URLs. All web servers allow you to select one or more default filenames to serve when the browser requests a directory. For example, this website is run on IIS, so when the user requests “http://janeandrobot.com” we really serve “http://janeandrobot.com/default.aspx”. In the same code you use to enforce www vs. non-www, you should also check and see if the default filename is at the end of the URL and then trim it off. So, “http://janeandrobot.com/default.aspx” would be converted to “http://janeandrobot.com/”.
  3. Link internally to the canonical form of your URL. Make sureyou always link to the proper canonical form of your URLs from within your site. This practice helps encourage external sites to link to the site using the correct version as well (since those linking to you often cut and paste from your pages or RSS feed.) Note there is a degree of diminishing returns here, so you don’t need to spend the whole weekend hunting down every last URL. Just make sure to review your site’s primary navigation, top landing pages and blog.
  4. Use Google Webmaster Tools to tell Google the correct form. Implementing these best practices on your site are ideal, since they address the problem for all search engines and give your customers a consistent, properly branded navigation experience. But what can you do if you reviewed steps 1-3 and found that it would take six months to implement on your production site? There is something that you can do today: using Google’s Webmaster Tools, you can navigate to the “Tools” section and select “Set preferred domain.” Here you can specify if you’d like Google to  use “www.website.com” or “website.com” in their index and search results, as well as consolidate links to both versions. Note that while this will provide you short-term benefit from Google, it does not help you in Yahoo! or Live Search.

Checking Your Website

To check your website to see if you’re handling domain canonicalization correctly, you can use the Live HTTP Headers add-on for Firefox. 

livehttpheaders

Open the Live HTTP Headers tool, then try all the variations of the URL at several different levels to ensure they all redirect back to the appropriate canonical form. As you’re checking each variation, look at the HTTP headers using the Firefox plug-in to ensure they are all 301 redirects (and not, for instance, 302 redirects).

Here’s an example test case:

Canonical URL Form Test Case Test Result

http://janeandrobot.com

janeandrobot.com Success
  janeandrobot.com/default.aspx Success
  www.janeandrobot.com Success
  www.janeandrobot.com/default.aspx Success

http://janeandrobot.com/about.aspx

janeandrobot.com/about.aspx Success
  www.janeandrobot.com/about.aspx Success

http://janeandrobot.com/folder

janeandrobot.com/folder Success
  janeandrobot.com/folder/default.aspx Success
  www.janeandrobot.com/folder Success
  www.janeandrobot.com/folder/default.aspx Success

http://janeandrobot.com/folder/test.aspx

janeandrobot.com/folder/test.aspx Success
  www.janeandrobot.com/folder/test.aspx Success

Examples

Canonicalization issues are very common and being an Microsoft employee, I don’t have to go far to find an example. Check out the website for Microsoft’s annual Mix conference for web developers. 

mix08-screen-shot

I was able to generate the table below by plugging the common URL variations into Yahoo’s Site Explorer to find a list of links to each variation. 

URL Variation Number of Links from within website Number of Links from outside websites

http://visitmix.com

17,663 59,498

http://www.visitmix.com

9,074 22,179

http://visitmix.com/default.aspx

0 22

http://www.visitmix.com/default.aspx

0 12

Looking through these numbers yields some interesting insights:

  • Not doing “www” vs “non-www” is definitely hurting their ranking – you can tell because they have a similar number of inlinks for each version. Ranking is done on a logarithmic scale, so every additional link is more valuable than the one before. If they redirected all versions to one canonical form, search engines would see their home page has having 81,711 external links, would would be a substantial boost.
  • They are not good about using the same version of the URL within their site. If you’re not cognizant of this on your site, others won’t be either. It looks like they use visitmix.com about 75% of the time internally, and www.visitmix.com the other 25%.

Additional Resources

{ 2 trackbacks }

Canocalization issues? | Bruno Mertins | SEO | SEM
June 3, 2009 at 3:40 pm
Beyond the Sitemap Protocol — jane and robot
October 19, 2009 at 6:09 pm

{ 25 comments… read them below or add one }

VaBeachKevin May 9, 2008 at 7:12 am

This is one of the most overlooked items in my opinion. Great post.

Kittu May 21, 2008 at 8:48 pm

I’ve heard using this on page redirection may be considered as a 302 redirection in the eyes of crawlers, because at first crawler is going to that page and read the code then it gets the instruction to move to the directed page.
n Where as i know the safest way is to move yourself to some Linux server which will be using apache and it stores a file named [b].htaccess[/b] you can give instructions of redirection within that file, cuz whenever a request is generated the crawler first reads into the .htaccess file this tells the crawler which page is to shown for the requested one and thus it is the complete 301 redirection. :)

Vanessa Fox May 22, 2008 at 8:13 am

Hi Kattu – You’re absolutely right that a 301 is the way to go. There are multiple ways of implementing a 301 (including using .htaccess if your server is Apache, as you’ve described). We’ll be posting follow up articles about implementation techniques.

As for what you mention in your first paragraph, when you use an on page meta refresh, crawlers may interpret that differently than you expect. We’ll be diving into those details in our implementation article as well.

Ashley Berman Hale June 10, 2008 at 3:23 am

Hm – interestingly enough you have a link or two pointing to janeandrobot.com/default.aspx what does not 301 to janeandrobot.com.

I just thought I would give you a heads up about that. It looks like you covered it in the test case, so it might be a server/load balancing issue. Checked your header on that page and its still showing as 200.

/beep.

Sarah June 11, 2008 at 11:48 pm

Not only is this particular article/tutorial brilliant, but so far the entire Jane + Robot site says it all exactly as it always should have been said – and all in one place. Things I try to tell my clients every day, with varying degrees of success.

Even better, the site isn’t only just articles, it’s an authoritative resource that cites other documentation. THANK YOU.

Nathan Buggia June 12, 2008 at 4:35 am

@Ashley – good catch, as many of you know implementing proper canonicalization can be a lot more difficult than just writing down the best practices :)

We’re still working on fixing the canonicalization of this site, we currently are tracking down a bug in our content management system, hopefully it will be fixed soon!

g1smd June 13, 2008 at 12:55 am

*** [i] removing the "www." saves your customers 4 keystrokes[/i] ***

If you have the site-wide 301 redirect from [i]non-www[/i] to [i]www[/i] in place, then the visitor can omit typing the [i]www[/i] in, and your redirect will deliver them to the correct URL and to the correct content anyway.

There are good reasons to use the [i]www[/i] version as the canonical form, not least the ability to do:

[i]site:domain.com -inurl:www[/i]

to make sure that no other forms, other than www that is, have been indexed.

You can’t do that if your redirect runs the other way.

Josh June 26, 2008 at 8:10 am

The reason we advise clients to [b]always[/b] prefer www to non-www is that it makes people notice the URL in print advertising, signage, and other media. "www." is a very powerful visual cue to the presence of a URL. Having the brand "stand out" through the lack of www is only important in those cases where the URL is shown without other material, which is rare and inadvisable.

Nathan Buggia July 1, 2008 at 6:47 pm

@Josh and @G1SMD – good points, you’ve come real close to selling me on "www"

Randy Cooper July 7, 2008 at 5:53 am

I’ll go with Josh on the www

I’m curious now though about the use of subdomains. I’ve heard from both camps (1. builds pagerank on the primary domain) and (2. considered totally separate)

RKF July 7, 2008 at 11:42 am

I’m personally a fan of the non-www addresses. For most clients I’ll use the www because they often assume it and print it on their marketing material. For me … it’s unnecessary. I think people queue off the .com more than the www, and having the www in print can make it more difficult visually for a client to remember the domain name (especially on a vehicle or billboard). The most important thing for them to remember is the domain name (because you DO have your .com registered, right?) when you have your redirects in place.

g1smd July 12, 2008 at 10:53 am

Even if your site does use the www as the "real" address, you can still advertise the site without the www in both print media and broadcasting channels, and let the redirect fix up the URL after the user types it in.

For example, when I want to do a search at Google, I type google.com in to the browser, nothing more. I don’t bother with the www, as Google’s own redirect automatically adds it on for me and then lets me search.

Larry Swanson July 23, 2008 at 6:00 pm

I think the www vs. non-www decision should also consider your audience. If you’re trying to reach web-savvy techies, then by all means omit the www, but if you’re trying to reach less tech-savvy "civilians" and/or doing a lot of offline promotion, then keep the www (for the reasons RKF mentions above). In either case, it is important to be consistent in your usage across all media (you might call this "canonical branding") since you never know where someone will be when they jot down your URL and link to you.

payday loans August 27, 2008 at 10:15 pm

This information regarding cannibalization of the URL will definitely help. But how to use Live HTTP Headers tool to know the exact status of site regarding canonical issues.

breeders cup tips October 23, 2008 at 7:28 pm

Good idea of posting that article..One day I question it to my self, but because of lazyness to find some answer I forgot to research it..and now it catch my attention when i read this article..Wow..now I know..thanks.

Gustavo November 11, 2008 at 2:47 am

Hi Nathan,

Great post!

Would you mind sharing the commands you used with Yahoo Site Explorer (or what did you select on the drop-downs) to generate the "Number of Links from within website" as per your table above. Maybe just one example?

Thanks,
Gustavo

freerolls November 14, 2008 at 5:14 am

Great post.. I thought all 4 url’s are same. But you have created doubt in my mind with major differences:D:D hahah..
appreciated mate..
regards,
<a href="http://www.casinator.com/freerolls.php">freerolls</a>

auto hifi November 15, 2008 at 5:40 pm

I’ve heard using this on page redirection may be considered as a 302 redirection in the eyes of crawlers, because at first crawler is going to that page and read the code then it gets the instruction to move to the directed page.
regards,
<a href="http://www.powernetshop.at/">Auto Hifi</a>

Tom Funk November 19, 2008 at 8:23 am

Great, great post, Nathan! Thank you! I follow most of it and agree with it, but I have a question.

I hear you when you say (item 2) "remove your default filename from the end of your URLs."

So you would definitely not 301 http://www.mydomain.com to http://www.mydomain.com/default.aspx. You would instead let http://www.mydomain.com return a "200 Found" header and display the default page via IIS.

But what, if anything, do you do to the http://www.mydomain.com/default.aspx page? Is it appropriate to 301 it to the canonical http://www.mydomain.com?

I worry that even if I myself don’t include the default page in my URLs, external sites might somehow directly link to the default page, it could get spidered, etc. and come to have a page rank of its own.

Tom Funk November 19, 2008 at 8:24 am

Oops, sorry, those links rendered weirdly, I hope you can make sense of what i was babbling :)

Nisha Singh November 23, 2008 at 5:26 pm

I think it is great article on domian canonicalization. I have a site on .asp. The site index.asp comes with the home page. Can I redirect it also with 301? One of the my friends told me to don’t redirect because it is home page. Suggest me properly.

Thanks
http://mobilephonesandtechnologies.blogspot.com/

Nathan Buggia November 27, 2008 at 8:36 am

[b]@Gustavo[/b] – for the site explorer tool, these are the options I select: (a) Inlinks (b) Show Inlinks: Except from this domain (c) to: Only this URL.

What this does is to remove all the links pointing to this page from your website so you can see the effect of external links (which is far more important in ranking to search engines). Here’s a link to the site explorer tool with the aforementioned options enabled:

http://siteexplorer.search.yahoo.com/search?p=http%3A%2F%2Fvisitmix.com&bwm=i&bwmo=d&bwmf=u

[b]@Tom Funk[/b] – hey tom, it is completely okay if folks still link to mydomain.com/default.aspx. If you’re using the redirect I recommend above, when a search engine encounters this URL, your website will 301 redirect them back to the canonical version, mydomain.com and it will never store the version of the URL with the default filename.

[b]@Nisha Singh[/b] – the redirecting should still work in your case as well. I recommend trying it out and then going through all of the test cases I listed above. If they all work, you should be good.

Honda CBR January 6, 2009 at 11:39 pm

I like to have my site without the WWW. i feel it makes it state out more in the search engine results!. I also write all of my heads in CAPS. not sure if all this helps but I feel that it does.

Mike Andrew November 12, 2009 at 11:24 pm

Great post Nathan, I’ve only just discovered your site and have bookmarked it now. This article is very simple to understand and I for one will be checking my sites tonight. I also have not specified with Google webmaster tools a preferred domain, but I will now.

Thanks again

Andrea Moro January 25, 2010 at 7:55 am

Hi, just to inform you that all image url are broken and the images are not visible at all.

Leave a Comment

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Previous post:

Next post: