URL Referrer Tracking

by Nathan Buggia

There may be instance when you want to track the source of a request, and a common way of doing so is by using tracking parameters in URLs. Unfortunately, implementing referrer tracking in this way can result in significant issues with search engines. In particular, it can cause duplicate content issues (since the search engine bot finds multiple valid URLs that point to the same page) and ranking issues (since all the links to the page aren’t to the same URL).

Let’s say that Jane and Robot uploaded two different online training seminars to YouTube as part of a viral marketing effort to drive more traffic to our site. To gauge our return on investment from each of these seminars, we’ve added a tracking parameter to the link within each YouTube description that a customer can click on to learn more, here are the two URLS: http://janeandrobot.com/?from=promo-seminar-1 and http://janeandrobot.com/?from=promo-seminar-2. Each would bring the customer to our home page (the same page served by http://janeandrobot.com) and we would track the conversions based on the from parameter in the URL.

While this solution may seem to work well initially, it can result in low quality tracking data and impact our search acquisition. Here’s a summary of the major problems:

  1. Duplicate content – search engines sometimes have difficulty determining if two URLs contain the exact same page (see canonicalization for more information). In this case, we’re creating this problem because we’ve created multiple URLs for the same page. Search engines are likely to find all three URLs for the home page and store/ rank them as separate content within their index. This could cause the search engine robots to crawl the page three times instead of just once (which may not be a big deal if we are only tracking two promotions, but could become a big problem if we used similar tracking parameters for many other campaigns and URLs). Not only are the robots using more bandwidth than is necessary, but since they don’t crawl a site infinitely, they could spend all the allotted time crawling duplicate pages and never get to some of the good unique pages on the site.
  2. Ranking – search engines use the number of quality links pointing to a URL as a major signal in determining the authority and usefulness of that content. Because we now have three different URLs pointing to the same page, people have three choices when linking to it. The result is a lower rank for all of the variations of the URL. Search engines generally filter out duplicates, so for instance, if the original (canonical) home page has 100 incoming links and each URL with a tracking parameter has 25 links, then search engines might filter out the two URLs with fewer links and show only the canonical URL, ranking it at position eight for a particular query based on those 100 incoming links. If all incoming links were to the same URL, then search engines would count 150 links to the home page and might rank it at position three for that same query. Another danger is that if one of the YouTube promo videos becomes exceptionally popular, its promo URL might gain more links than the original home page URL. Using this same example, if one of the promo URLs gained 200 links, search engines might choose to display it in the search results over the original home page. This could cause a confusing experience for potential customers who are looking for your home page (http://janeandrobot.com/?from=promo-seminar-1 doesn’t look like a home page and searchers might be less likely to click on it, thinking it’s not the page they’re looking for). It’s also not ideal from a branding perspective.
  3. Reporting quality – as social networking sites become more popular, we become more of a sharing culture online. Many people use bookmarks, and online bookmarking sites such as Delicious, email, and other sharing sites such as Facebook, Twitter, and FriendFeed to save and share URLs. They’ll click on on a URL, and if they like it, copy and paste it from the browser’s address bar. If the link they’re saving/sharing happens to be one of our promotional links, then they have preserved this link for all time, and everyone who clicks through the link will look identical to someone coming through the promo. This skews the reporting numbers of who went to the site after viewing the video — which was why we set up the tracking parameters in the first place!

Implementation Options

Unfortunately there is no perfect solution for this scenario, and what works best for you depends on your infrastructure and situation. Here we’ve listed several common solutions that you can choose from to improve your own implementation. We generally recommend the first solution (Redirects), but there are pros and cons to each option that you should review carefully before making your decision.

Redirects (and Cookies)

The first option strives to solve the problem by trapping all of the promotional requests, recording the tracking information, then removing the tracking parameter from the URL. This can be time consuming to implement, but it is the best all-round scenario to address the three major issues listed above.

If you wanted to get fancy, and track a user’s entire session based on your referral parameter, then you can use this method as well and simply set a cookie on the client machine at the same time you trap the request. This is recommended to understand the value of traffic from different sources. In either case, here are the steps you’ll need to undertake:

1. Trap the incoming request – find where you web site application’s logic processes the HTTP request for your page. Trap each request at that point and check if it has a tracking parameter. If it does, record this in your internal referral tracking system. You can record this either in your server logs, or in a custom referral tracking database you maintain on your own.

  • If you also would like to track the entire user’s session, then you should also use this opportunity to set a cookie on the client.

2. Implement the redirect – next step is to implement a 301 redirect from the current URL to the same page without the tracking parameter (or the canonical URL). Don’t for get to use the cache-control attribute in the HTTP header to ensure that all the requests come to your server and don’t get handled automatically in some network-based cache. Here’s what a sample redirect header might look like:

301 Moved Permanently
Cache-Control: max-age=0

Note that ASP.Net and IIS both use 302 redirects by default, so you many need to manually create the 301 response code.

The way this works is that when a search engine encounters a promotional URL (http://janeandrobot.com/?from=promo-seminar-1) it issues an HTTP GET request to the URL. The HTTP response tells the search engine that this page has been permanently moved (301 Redirect) and provides the new address (the same as the old address but without the tracking parameter). The search engine then discards the first URL (with the tracking code) and only stores the second URL (without the tracking code). And everything is right in the world.

This implementation is one of the best options, but it does have some limitations:

  • One downside of this method is that it requires you to manage your own referral tracking system. Because it traps the referral parameters and removes them from the URL before the page actually loads, 3rd party referral tracking applications like Google Analytics, Omniture, WebTrends or Microsoft adCenter Analytics will not be able to track these referrals.

Canonical URL <Link /> Tag

Possibly the simplest option to solve this issue is to take advantage of a new standard recently adopted by Google, Yahoo and Microsoft Live Search. Their solution to this problem is to use a new attribute of the <link /> tag to explicitly tell them what the canonical URL for the page is. Assuming the <link /> tag has been created correctly, the search engines will treat this like the a 301 redirect to the canonical URL.

Here’s an example of using this tag:

<html>
   <head>
      <link rel="canonical" href="http://janeandrobot.com" />
   </head>
</html>

Here’s a few notes about implementing this tag:

  • Search engines view this as a hint, not a command. Implementing this tag isn’t a guarantee, although Google said they will try their best to make it work. The reason they can’t give any guarantees is because they may detect that you are implementing it incorrectly, or it is being used for some type of spammy scenario.
  • Relative or absolute URL are supported within the href attribute. However, I recommend that you use absolute URLs whenever possible. This helps the search engines further normalize the URLs because they see what protocol (http or https) you use, and whether or not you are prefixing your domain with “www.”.
  • Sub-domains are supported, separate domains are not. With this tag you can specify a separate a different sub-domain, for example within this URL (http://janeandrobot.com?from=promo-seminar-2) you could specify this canonical URL (http://videos.janeandrobot.com). However, the <link /> tag would not be valid if you specific a completely different domain like this http://janeandrobot-videos.com.
  • Common Pitfalls… You’ll want to ensure that you don’t do anything silly like (i) create an infinite loop with two canonical tags pointing to each other (ii) have the canonical tag point to a page that returns a 404 status code. You should also make sure that your canonical URL is generally a short and simple URL.

While this implementation seems a little too good to be true, there are a few potential downsides. The first is that if you implement it incorrectly, the search engines will simply ignore it, and that could be complicated to debug. The other issue is that it fixes issues #1 (duplicate content) and #2 (ranking) but does nothing to fix the 3rd issue of reporting. Still, given all of that I would likely implement this option first and do the others when I had some spare dev cycles.

URL Fragment

A simple and elegant option is to simply place the tracking parameter behind a hash mark in the URL, creating a URL fragment. Traditionally, these are used to denote links within a page, and are ignored completely by search engines. In fact, they simply truncate the URL fragment from the URL.

Old URL

  • http://janeandrobot.com/?from=promo-seminar-1
  • http://janeandrobot.com/?from=promo-seminar-2

New URL with URL Fragment

  • http://janeandrobot.com/#from=promo-seminar-1
  • http://janeandrobot.com/#from=promo-seminar-1

By default Google Analytics will ignore the fragment as well, however there is a simple work around that was provided to us by Avinash Kaushik, Google’s web metrics evangelist. Using the following JavaScript:

var pageTracker = _gat._getTracker("UA-12345-1");
 
// Solution for domain level only
pageTracker._trackPageview(document.location.pathname + "/" + document.location.hash);
 
// If you have a path included in the URL as well 
pageTracker._trackPageview(document.location.pathname + document.location.search +
                           "/" + document.location.hash);

You can create a few additional variations of this if you also have additional queries in the URL you would like to track. Check with your web analytics provider to find out if you need to customize your implementation to account for using URL fragments for tracking.

Does this sound too simple and easy to be true? There are a couple downsides to this approach:

  • This option fixes issues 1 (duplicate content) & 2 (ranking) listed above, but it will not address the 3rd issue of reporting. You could still encounter some reporting issues using this method if people are bookmarking or emailing around the URL.
  • Typically you’ll have to write some custom code to parse the URL fragment. Since it’s a non-standard implementation, standard methods may not support this.

Robots Exclusion Protocol

Another relatively simple solution is to use robots.txt to ensure that search engines are not indexing URLs that contain tracking parameters. This method enables you to ensure that the original (canonical) version of the URL is always the one indexed and avoids the duplicate content issues involving indexing and bandwidth.

Assuming that all of our tracking parameters will follow a similar pattern to this:

http://janeandrobot.com/?from=<PromoID>

we can easily create a pattern that will match for this. Below is a robots.txt file that implements the pattern:

# Sample Robots.txt file, single query parameter
User-agent: *
Disallow: /?from=

The first line means that this rule should apply to all search engines (or robots crawling your site), and the second line tells them that they can’t index any URLs that start with ‘janeandrobot.com/?from=’ and some type of promotional code of any length. See complete information on using the Robots Exclusion Protocol. Use this pattern if you will have multiple query parameters:

# Sample Robots.txt file, multiple query parameters
User-agent: *
Disallow: /*from=

Once you’ve implemented the pattern appropriate for your site, you can easily check to see if it is working correctly by using the Google Webmaster Tools robots.txt analysis tool. It enables you to test specific URLs against a test robots.txt file. Note that although this tool tests GoogleBot specifically, all the major search engines support the same pattern matching rules. In Google Webmaster Tools:

  1. Add the site, then click Tools > Analyze robots.txt. (Unlike most features in Google Webmaster Tools, you don’t need to verify ownership of the site to use the robots.txt analysis tool). The tool displays the current robots.txt file.
  2. Modify this file with the Disallow line for the tracking parameter. (If the site doesn’t yet have a robots.txt file, you’ll need to copy in both the User-agent and Disallow lines.)
  3. In the Test URLs box, add a couple of the URLs you want to block. Also add a few URLs you do want indexed (such as the original version of the URL that you’re adding tracking parameters to).
  4. Click Check. The tool displays how Googlebot would interpret the robots.txt file and if each URL you are testing would be blocked or allowed.

At this point you may be thinking, wow, I can do all this and not have to write any new code? Unfortunately, there are even more downsides to this approach than the others:

  • This option will fix issue 1 (duplicate content), but not issues 2 (ranking) and 3 (reporting). This can be a good interim solution while you’re implementing the more complete redirects solution, but it often isn’t useful enough on its own.
  • Likely this will take a little bit of extra testing to ensure you get the patterns correct in your robots.txt file and don’t inadvertently block content you want indexed.

Yahoo Site Explorer

Yahoo provides an online tool designed to solve this scenario. However, the solution only helps with Yahoo search traffic. To use the Yahoo fix, simply go to http://siteexplorer.search.yahoo.com and create an account for your web site in the Yahoo Site Explorer tool. Once you’ve verified ownership of your web site, you can use their Dynamic URL Rewriting tool to indicate which parameters in your URLs Yahoo should ignore.

Yahoo URL Rewriting Tool

Yahoo URL Rewriting Tool

Simply specify the name of the parameter you use for referral tracking (in our example it is ‘from’), and set the action ‘Remove from URLs’. Yahoo will then remove that parameter from all of your URLs while processing them and give you a handy little report about how many URLs where impacted.

Again, this is another solution that seems too easy to be true, but again, there are some significant limitations with this approach:

  • At the end of the day this is still a Yahoo-only solution. With approximately 20% market share, it is likely this will not meet all of your needs. However, if you do get some percentage of your traffic from Yahoo, there is no harm in doing this in the short term while you implement another method in the longer term.
  • The other problem with this solution is that it doesn’t solve issue #3 (reporting), so you are still susceptible to reporting errors due to folks bookmarking and emailing your URLs with tracking codes.

Common Pitfalls

Cloaking & Conditional Redirects

Some web sites and SEO consultants attempt to solve this by a technique called cloaking or conditional redirects. Essentially what these methods do is check if the HTTP GET request is coming from a search engine and then show them something different than normal users see. This something different could be a simple 301 redirect back to the page without the tracking parameter similar to our first solution above. The difference is that our solution implemented this redirect for all requesters, and cloaking/ conditional redirects implement it only for search engines.

The big problem with this implementation method is that cloaking and conditional redirects are explicitly prohibited in the webmaster guidelines for Google, Yahoo and Live Search.  If you use this method, you risk your pages being penalized or banned by the search engines. The primary reason they prohibit this behaviors is because they want to know exactly what content they are presenting searchers using their service. When a web site shows something different to a search engine robot than to a general user, a search engine can never be sure what the user will see when they go to the web site. So, even if you’re thinking of implementing cloaking for what seems to be a valid, and not deceptive, reason, it’s still a technique search engines strongly discourage.

This leads to the second major problem with this implementation method – it adds significant complication and can be difficult to monitor whether or not it’s working – e.g. you have to test it pretending to be each of the 3 search engines robots. When things go wrong, it is likely that you’re not going to see it right away, and by the time you do, your search engine traffic may already be impacted. Check out this example when Nike ran into an issue with cloaking.

Crazy Tracking Codes

Many studies on the web that show customers prefer short, understandable URLs over long complicated ones, and are more likely to click on them in the search results. In addition, users prefer descriptive keywords in URLs. Therefore, it might be worth your time to spend a few extra minutes thinking about the tracking codes you use to see if you can make them friendlier.

Good examples

  • ?from=promo
  • ?from=developer-video
  • ?partner=a768sdf129

Bad examples

  • ?i=A768SDF129,re23ADFA,style-23423,date-2008-02-01&page=2
  • ?IAmSpyingOnYou=a768sdf129&YouAreASucker=re23adfd

Testing Your Implementation

So you’ve implemented your new favorite method, it compiles on your dev box, and now it’s time to roll it into production, right? Maybe not! The initial goal of referrer URL-based tracking was to understand where your traffic was coming from so you can use that information to optimize your business. To ensure the data your collecting is actually useful, we highly recommend that you do some testing to ensure that all the common scenarios are working the way you expect, and you know where the holes are in your measurement capabilities. As with all metrics on the web, there will be holes in your data so you need to know what they are and account for them.

The first step in testing the implementation is to try it with a test parameter, walking the full scenario through start to finish.

  1. Create several phoney promotional links that reflect the actual types of links you expect. This could be on your home page, product pages or with many additional query parameters that you might encounter.
  2. Place these fake promotional links in a location that won’t confuse your customers but are likely to get indexed by search engines. Using a social networking site or a blog might serve this well.
  3. Click through those links as a customer and verify that you get to the correct page with a good user experience. Be sure to take these into account as well:
    • Redirects operating properly (if you’re using them) – use the Live HTTP Headers tool in FireFox to ensure the application is providing the correct headers (301 redirect and caching).
    • Major browsers all work- if you’re using cookies, you should test all the major browsers to ensure that they support cookies and that your scenario works the way you might expect. Don’t forget to try common mobile browsers if your customers access your site this way.
  4. Check out the search engine experience to ensure that you’re not running into the duplicate content or ranking issues.
    • Major Engines submit URL – if you place the test URLs in the right social network or place on your blog, they should get indexed within a week or so. If they don’t you can also try the “submit a URL” from Google, Yahoo and Microsoft, though they are not guaranteed to work. Essentially you want to make sure the search engines have had the opportunity to see these URLs.
    • Use ’site:’ command to ensure tracking URLs are not indexed – here’s an example query in Google, Yahoo, and Microsoft showing that our Jane and Robot example promotional URLs are not indexed.
  5. Take a look at your metrics and ensure the numbers you’re recording correlate to the testing you are doing. Some additional things to consider:
    • Internal referrals - you might also want to add some logic to your application to filter out (or exclude) all referrals from the development team and your own employees. This is often done by checking requests against a list of known employee or company IP addresses and scrubbing those from your tracking data.
    • Caching Issues - you might also want to try out several scenarios with multiple subsequent requests. You’ll want to ensure that every request is going to your server and not getting cached somewhere along the way.

Related Resources

{ 18 comments… read them below or add one }

Ophir Cohen December 7, 2008 at 6:42 pm

Hey Nathan,
This is a very strong post! I personally like best the redirection method, however I’ve heard lately that Google can actually tell which are tracking parameters sometimes, I guess especially when you use the GA tracking syntax such as utm_… etc.

Dave Dugdale December 8, 2008 at 7:20 am

Nathan good post, I must say I was impressed at Microsoft’s showing at PubCon last month. Everyone I spoke to at your booth was very knowledgable. Which was quite the opposite at the Google booth, I have several questions on the Google products like Google Ad Manager that no one could answer. Very disappointing.

With the absence of Yahoo at PubCon, could we be seeing MS starting to really make a strong effort at becoming the #2 search engine. It just seems like with the brain drain over at Yahoo and having two guys like Nathan and Jeremiah Andrick, MS might have a case at this.

Jonathan Hochman December 8, 2008 at 10:04 pm

*Sigh*

Why are we jumping through hoops for the search engines? There should be a meta tag where we can list URL parameters that have no impact on content. All such parameters should be ignored by the search engines. Is this really so hard?

Yeah, this could be gamed, but so can a lot of other web development methods. If the search engines are worried about abuse, they can run tests to see if the pages generated with the ignored URL parameters are equivalent, or not.

Jaan Kanellis December 9, 2008 at 2:37 pm

I would love to see the same type of detailed post for out-bound (internal and external) link tracking using Google Analytics and other tracking systems

Nick Gerner December 10, 2008 at 10:47 am

I see these issues all the time from big players. It’s just astounding to see how frequently this happens.

I agree with the spirit of Jonathan’s comment above about jumping through hoops for the search engines. I once heard that a "useful test is to ask… would I do this if search engines didn’t exist?" And it probably is in many cases.

But I think Nathan points out a great reference regarding the usability issues these tracking parameters introduce. It hurts your users too! And the reality is, we’re all getting traffic from the search engines. Even my pitiful blog traffic is mostly search engine based.

Given the small investment involved (take your pick of Nate’s several choices), this seems a small tax to pay to improve usability, click-throughs, and searchability.

Nice work Nate ;)

David - About Results Marketing December 10, 2008 at 7:21 pm

Great Post, I’m a real fan of Avinash, and have been killing myself on how to implement tracking on my websites. My clients don’t really do much with social media (as per your specific example), and it’s enough for me to know which site the traffic is coming from, which shows up anyways.

What I really wanted to implement was a source to track which navigation links users are clicking on. However, as mentioned, this provides some issues. The 301 redirect should be enough – as the dynamic URL is never in the navigation bar, and the user will only get it if they right click the page. Even then, it merely redirects to the correct page.

I’m assuming that the search engines are working on filtering it out, although it would be nice to see an industry wide effort among developers and the search engines in creating a standardized format for link tracking, session parameters, and simple URL conventions.

That would lose all of us SEO consultants some excuse for charging money – but it would really save some time ;)

Nathan Buggia December 10, 2008 at 4:28 pm

Thank you all for the great comments.

To Nick and Jonathan, I wish I could tell you guys about all the wonderful things we’re working on, but all I can say at this point is we hear your feedback loud and clear.

Jane and Robot will be the first to update this article should a better solution come along.

Van Lease December 11, 2008 at 10:08 pm

The 301 redirect method is my favoured option, although it does have limitations as you’ve pointed out in your thorough post.
I’ve been toying with using the robots.txt approach for a while and your post may be just the prompt I need to give it a try. Many Thanks

Malte Landwehr December 14, 2008 at 4:26 am

Haven’t read the whole article yet but the first advices sounded great. Bookmarked!

WannaDevelop.com December 15, 2008 at 10:00 pm

I’m going to give this a try.. Very interesting :)

Mike

Raffi December 17, 2008 at 4:37 am

I know I’m a little bit late to the party, but I have another suggestion for query parameters.

Spiders don’t report a referrer because they are direct-load. You could check for a referrer and when you don’t see one do a 301 redirect to the "clean" URL.

This way if someone else blogs about your URL you get the link credit from Google and can still track visitors that click through. Only "issue" I would see is that you’d lose tracking on someone who bookmarks your URL with parameters (on subsequent visits).

You’re still serving the same content to visitors and bots and you’re doing it without regard to user-agents.

I’d love to hear people’s thoughts on this.

ADAC December 28, 2008 at 3:47 pm

Thanks for the insight into how the search engines look at things. Most of these problems can be easily avoided if you know about them before you start programming the site.

zara clothing January 6, 2009 at 7:31 pm

I still don’t get it why some webmasters make multiple pages with the same content ?

Winwab January 17, 2009 at 4:45 am

Thanks for sharing this article.

Sankar May 27, 2009 at 5:23 am

Thanks Nathan.

I came to know lots of new points by reading this article. still I am not up to the mark in the points what you have discussed such as reporting using # parameter, canonical tag. Have to do a little more research about it and needs make myself perfect. Bookmarked this url, have to check later in detail.

Thanks
Sankar

Webmaster August 1, 2009 at 3:49 pm

Thanks for this important SEO information! Every SEO should know this!

Caleb Whitmore November 7, 2009 at 8:51 am

Your example for Google Analytics, all respect to my friend Avinash, won’t work with GA. Here’s why:

1) GA can’t deal with # in the URI gracefully. It just ignores # in the utmp= field of a utm.gif tracking hit (what will be produced by running pageTracker._trackPageview();). In order to use the method noted you have to write some additional JavaScript that drops the “#” from the location.hash result, puts the result into a variable, then placed that variable in the customized trackPageview(); call.

2) You completely overlooked the built in facility in Google Analytics for anchor-based campaign tracking parameters. The method is simply pageTracker._setAllowAnchor(true);. That will let you put your “utm_campaign” custom campaign parameters into the anchor string rather than URI parameters. In my opinion this should be defaulted to true in GA and is always a best practice to use when tagging your pages.

-Caleb

Polly Pospyelova February 19, 2010 at 1:49 am

Graet post! I have used all of these solutions in one way or the other before except from the canonical URL tag. My preference goes to the redirects and URL fragments. The last one is a very neat solutoin to avoid duplicate URLs on websites where you simply have no choice but have duplicate pages.

Polly

Leave a Comment

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Previous post: