If you happen to be analyst reading this article, you were probably confronted with spam traffic more than once. Don’t know what I’m talking about, just look at this:
This is one of our clients’ referral traffic data from the past. In top 10 referrers are domains who were sending fake traffic to the client website. Look at the number of sessions. Nonsense. What are these websites and why are they sending so much traffic?! They are just sitting there on 2nd and 3rd position. I’m just not going to show you what else I have found out of top 10, more spam with even weirdest hostnames. If you planned to visit these domains, I advise you, don’t. This is what they want you to do. This is why they are called spam traffic because this is their way of advertising whatever they are advertising. So why is this bad? Because of these fake sessions, reports become dirty and inaccurate. How can you pull a report to the client when 14% of all referral traffic are fake (talking about screenshot example). Fake sessions corrupts all your data and metrics. On higher traffic sites it’s not big of an issue. However, it’s a real pain in the a** when it happens to low traffic websites where high amount of traffic is not actually coming from the real people.
Types of spam
Bots crawl your website looking to harvest information. Some bots are good and some are bad.
For example, Google bot that crawl websites is good because it is doing it for indexation purposes. Bad are the ones that collects information (like emails), slowing your server and looking for vulnerabilities.
It’s a most common spam encountered in analytics. Ghost spam actually does not visit websites. They send data directly to Google Analytics account without any interaction whatsoever. Websites are randomly targeted so don’t fret by thinking it’s only your website they are after.
Identifying Referral Spam
You can spot most of them really easy. Open Google Analytics account and under Acquisition > All traffic > referrals you can see which hostnames are sending traffic. You should definitely choose a longer date range just to be sure you have enough data to recognize these bastards. Try to spot on some anomalies in metrics that doesn’t have any sense. Look for the ones that have:
- avg.session duration 0 seconds,
- opened page per session <1
- 100% bounce rate
- 100% new sessions
Below is the expanded screenshot from the first example. You can almost always see these anomalies in metrics where referral spam is situated. Although, some are easier to identify than others. Can you see why this is bad? The reports become skewed and this trash traffic makes higher bounce rates and engagement metrics unreal.
How to get rid of ghost spam referral?
Before we go on to the main thing, there is one option that should be always checked in every Google Analytics account, no matter if you’re affected by ghost spam or not. Go to view settings and hit the checkmark on “Exclude all hits from known bots and spiders”. This option will automatically filter out known spam sources from Google Analytics. You already had that checked? Nice.
Google analytics have a neat Filter option that gives us the power to exclude most of the future spam traffic from analytics data. The point is, we are not going to exclude fake referrals like most people do. It’s an endless process when every time new fake referral appears you have to make a change in Filters. You can keep and maintain the list of spam sources, but it’s just not effective in the long run. Most effective way of preventing them showing in Google Analytics is to create a filter that will only include valid hostnames, the real traffic on our website.
To identify your valid hostnames go to the Audience > Technology > Network. Make sure you set up long time range so you don’t miss any of the data. By clicking on the hostname dimension you will see a list of all hostnames, including spam. From this report, find only your valid hostnames. You need to do this carefully, because if you forget to include one, your data could be incomplete, showing you less traffic than you are actually getting. Depending on the amount of traffic it generates, you might want to include translate.googleusercontent.com as users can display your website content through Google Translate service.
For Neuralab website, we could use simple regex expression when including hostnames in filter.
The pipe “|” is a regex symbol which means “OR”.
By using this expression, we are just going to include all traffic from www.neuralab.net or blog.neuralab.net. If you found more valid hostnames you can add them in regex expression so you don’t have to add them one by one. After all hostnames are found, it’s time to include them in Filters.
1. Go to the admin section of Google Analytics and click on Filters
2. Add new filter and name it accordingly
3. Pick custom filter type and choose include (very important!)
4. Choose hostname under filter field
5. Under filter pattern place the regex expression that contains your valid hostnames
6. Click save and you’re finished
This is it. From now on, you shouldn’t have no more problems with ghost spam. Kick back and enjoy analyzing your much cleaner data.