Tag Archive | "Data"

Trust Your Data: How to Efficiently Filter Spam, Bots, & Other Junk Traffic in Google Analytics

Posted by Carlosesal

There is no doubt that Google Analytics is one of the most important tools you could use to understand your users’ behavior and measure the performance of your site. There’s a reason it’s used by millions across the world.

But despite being such an essential part of the decision-making process for many businesses and blogs, I often find sites (of all sizes) that do little or no data filtering after installing the tracking code, which is a huge mistake.

Think of a Google Analytics property without filtered data as one of those styrofoam cakes with edible parts. It may seem genuine from the top, and it may even feel right when you cut a slice, but as you go deeper and deeper you find that much of it is artificial.

If you’re one of those that haven’t properly configured their Google Analytics and you only pay attention to the summary reports, you probably won’t notice that there’s all sorts of bogus information mixed in with your real user data.

And as a consequence, you won’t realize that your efforts are being wasted on analyzing data that doesn’t represent the actual performance of your site.

To make sure you’re getting only the real ingredients and prevent you from eating that slice of styrofoam, I’ll show you how to use the tools that GA provides to eliminate all the artificial excess that inflates your reports and corrupts your data.

Common Google Analytics threats

As most of the people I’ve worked with know, I’ve always been obsessed with the accuracy of data, mainly because as a marketer/analyst there’s nothing worse than realizing that you’ve made a wrong decision because your data wasn’t accurate. That’s why I’m continually exploring new ways of improving it.

As a result of that research, I wrote my first Moz post about the importance of filtering in Analytics, specifically about ghost spam, which was a significant problem at that time and still is (although to a lesser extent).

While the methods described there are still quite useful, I’ve since been researching solutions for other types of Google Analytics spam and a few other threats that might not be as annoying, but that are equally or even more harmful to your Analytics.

Let’s review, one by one.

Ghosts, crawlers, and other types of spam

The GA team has done a pretty good job handling ghost spam. The amount of it has been dramatically reduced over the last year, compared to the outbreak in 2015/2017.

However, the millions of current users and the thousands of new, unaware users that join every day, plus the majority’s curiosity to discover why someone is linking to their site, make Google Analytics too attractive a target for the spammers to just leave it alone.

The same logic can be applied to any widely used tool: no matter what security measures it has, there will always be people trying to abuse its reach for their own interest. Thus, it’s wise to add an extra security layer.

Take, for example, the most popular CMS: WordPress. Despite having some built-in security measures, if you don’t take additional steps to protect it (like setting a strong username and password or installing a security plugin), you run the risk of being hacked.

The same happens to Google Analytics, but instead of plugins, you use filters to protect it.

In which reports can you look for spam?

Spam traffic will usually show as a Referral, but it can appear in any part of your reports, even in unsuspecting places like a language or page title.

Sometimes spammers will try to fool by using misleading URLs that are very similar to known websites, or they may try to get your attention by using unusual characters and emojis in the source name.

Independently of the type of spam, there are 3 things you always should do when you think you found one in your reports:

  1. Never visit the suspicious URL. Most of the time they’ll try to sell you something or promote their service, but some spammers might have some malicious scripts on their site.
  2. This goes without saying, but never install scripts from unknown sites; if for some reason you did, remove it immediately and scan your site for malware.
  3. Filter out the spam in your Google Analytics to keep your data clean (more on that below).

If you’re not sure whether an entry on your report is real, try searching for the URL in quotes (“example.com”). Your browser won’t open the site, but instead will show you the search results; if it is spam, you’ll usually see posts or forums complaining about it.

If you still can’t find information about that particular entry, give me a shout — I might have some knowledge for you.

Bot traffic

A bot is a piece of software that runs automated scripts over the Internet for different purposes.

There are all kinds of bots. Some have good intentions, like the bots used to check copyrighted content or the ones that index your site for search engines, and others not so much, like the ones scraping your content to clone it.

2016 bot traffic report. Source: Incapsula

In either case, this type of traffic is not useful for your reporting and might be even more damaging than spam both because of the amount and because it’s harder to identify (and therefore to filter it out).

It’s worth mentioning that bots can be blocked from your server to stop them from accessing your site completely, but this usually involves editing sensible files that require high technical knowledge, and as I said before, there are good bots too.

So, unless you’re receiving a direct attack that’s skewing your resources, I recommend you just filter them in Google Analytics.

In which reports can you look for bot traffic?

Bots will usually show as Direct traffic in Google Analytics, so you’ll need to look for patterns in other dimensions to be able to filter it out. For example, large companies that use bots to navigate the Internet will usually have a unique service provider.

I’ll go into more detail on this below.

Internal traffic

Most users get worried and anxious about spam, which is normal — nobody likes weird URLs showing up in their reports. However, spam isn’t the biggest threat to your Google Analytics.

You are!

The traffic generated by people (and bots) working on the site is often overlooked despite the huge negative impact it has. The main reason it’s so damaging is that in contrast to spam, internal traffic is difficult to identify once it hits your Analytics, and it can easily get mixed in with your real user data.

There are different types of internal traffic and different ways of dealing with it.

Direct internal traffic

Testers, developers, marketing team, support, outsourcing… the list goes on. Any member of the team that visits the company website or blog for any purpose could be contributing.

In which reports can you look for direct internal traffic?

Unless your company uses a private ISP domain, this traffic is tough to identify once it hits you, and will usually show as Direct in Google Analytics.

Third-party sites/tools

This type of internal traffic includes traffic generated directly by you or your team when using tools to work on the site; for example, management tools like Trello or Asana,

It also considers traffic coming from bots doing automatic work for you; for example, services used to monitor the performance of your site, like Pingdom or GTmetrix.

Some types of tools you should consider:

  • Project management
  • Social media management
  • Performance/uptime monitoring services
  • SEO tools
In which reports can you look for internal third-party tools traffic?

This traffic will usually show as Referral in Google Analytics.

Development/staging environments

Some websites use a test environment to make changes before applying them to the main site. Normally, these staging environments have the same tracking code as the production site, so if you don’t filter it out, all the testing will be recorded in Google Analytics.

In which reports can you look for development/staging environments?

This traffic will usually show as Direct in Google Analytics, but you can find it under its own hostname (more on this later).

Web archive sites and cache services

Archive sites like the Wayback Machine offer historical views of websites. The reason you can see those visits on your Analytics — even if they are not hosted on your site — is that the tracking code was installed on your site when the Wayback Machine bot copied your content to its archive.

One thing is for certain: when someone goes to check how your site looked in 2015, they don’t have any intention of buying anything from your site — they’re simply doing it out of curiosity, so this traffic is not useful.

In which reports can you look for traffic from web archive sites and cache services?

You can also identify this traffic on the hostname report.

A basic understanding of filters

The solutions described below use Google Analytics filters, so to avoid problems and confusion, you’ll need some basic understanding of how they work and check some prerequisites.

Things to consider before using filters:

1. Create an unfiltered view.

Before you do anything, it’s highly recommendable to make an unfiltered view; it will help you track the efficacy of your filters. Plus, it works as a backup in case something goes wrong.

2. Make sure you have the correct permissions.

You will need edit permissions at the account level to create filters; edit permissions at view or property level won’t work.

3. Filters don’t work retroactively.

In GA, aggregated historical data can’t be deleted, at least not permanently. That’s why the sooner you apply the filters to your data, the better.

4. The changes made by filters are permanent!

If your filter is not correctly configured because you didn’t enter the correct expression (missing relevant entries, a typo, an extra space, etc.), you run the risk of losing valuable data FOREVER; there is no way of recovering filtered data.

But don’t worry — if you follow the recommendations below, you shouldn’t have a problem.

5. Wait for it.

Most of the time you can see the effect of the filter within minutes or even seconds after applying it; however, officially it can take up to twenty-four hours, so be patient.

Types of filters

There are two main types of filters: predefined and custom.

Predefined filters are very limited, so I rarely use them. I prefer to use the custom ones because they allow regular expressions, which makes them a lot more flexible.

Within the custom filters, there are five types: exclude, include, lowercase/uppercase, search and replace, and advanced.

Here we will use the first two: exclude and include. We’ll save the rest for another occasion.

Essentials of regular expressions

If you already know how to work with regular expressions, you can jump to the next section.

REGEX (short for regular expressions) are text strings prepared to match patterns with the use of some special characters. These characters help match multiple entries in a single filter.

Don’t worry if you don’t know anything about them. We will use only the basics, and for some filters, you will just have to COPY-PASTE the expressions I pre-built.

REGEX special characters

There are many special characters in REGEX, but for basic GA expressions we can focus on three:

  • ^ The caret: used to indicate the beginning of a pattern,
  • $ The dollar sign: used to indicate the end of a pattern,
  • | The pipe or bar: means “OR,” and it is used to indicate that you are starting a new pattern.

When using the pipe character, you should never ever:

  • Put it at the beginning of the expression,
  • Put it at the end of the expression,
  • Put 2 or more together.

Any of those will mess up your filter and probably your Analytics.

A simple example of REGEX usage

Let’s say I go to a restaurant that has an automatic machine that makes fruit salad, and to choose the fruit, you should use regular xxpressions.

This super machine has the following fruits to choose from: strawberry, orange, blueberry, apple, pineapple, and watermelon.

To make a salad with my favorite fruits (strawberry, blueberry, apple, and watermelon), I have to create a REGEX that matches all of them. Easy! Since the pipe character “|” means OR I could do this:

  • REGEX 1: strawberry|blueberry|apple|watermelon

The problem with that expression is that REGEX also considers partial matches, and since pineapple also contains “apple,” it would be selected as well… and I don’t like pineapple!

To avoid that, I can use the other two special characters I mentioned before to make an exact match for apple. The caret “^” (begins here) and the dollar sign “$ ” (ends here). It will look like this:

  • REGEX 2: strawberry|blueberry|^apple$ |watermelon

The expression will select precisely the fruits I want.

But let’s say for demonstration’s sake that the fewer characters you use, the cheaper the salad will be. To optimize the expression, I can use the ability for partial matches in REGEX.

Since strawberry and blueberry both contain “berry,” and no other fruit in the list does, I can rewrite my expression like this:

  • Optimized REGEX: berry|^apple$ |watermelon

That’s it — now I can get my fruit salad with the right ingredients, and at a lower price.

3 ways of testing your filter expression

As I mentioned before, filter changes are permanent, so you have to make sure your filters and REGEX are correct. There are 3 ways of testing them:

  • Right from the filter window; just click on “Verify this filter,” quick and easy. However, it’s not the most accurate since it only takes a small sample of data.

  • Using an online REGEX tester; very accurate and colorful, you can also learn a lot from these, since they show you exactly the matching parts and give you a brief explanation of why.

  • Using an in-table temporary filter in GA; you can test your filter against all your historical data. This is the most precise way of making sure you don’t miss anything.

If you’re doing a simple filter or you have plenty of experience, you can use the built-in filter verification. However, if you want to be 100% sure that your REGEX is ok, I recommend you build the expression on the online tester and then recheck it using an in-table filter.

Quick REGEX challenge

Here’s a small exercise to get you started. Go to this premade example with the optimized expression from the fruit salad case and test the first 2 REGEX I made. You’ll see live how the expressions impact the list.

Now make your own expression to pay as little as possible for the salad.

Remember:

  • We only want strawberry, blueberry, apple, and watermelon;
  • The fewer characters you use, the less you pay;
  • You can do small partial matches, as long as they don’t include the forbidden fruits.

Tip: You can do it with as few as 6 characters.

Now that you know the basics of REGEX, we can continue with the filters below. But I encourage you to put “learn more about REGEX” on your to-do list — they can be incredibly useful not only for GA, but for many tools that allow them.

How to create filters to stop spam, bots, and internal traffic in Google Analytics

Back to our main event: the filters!

Where to start: To avoid being repetitive when describing the filters below, here are the standard steps you need to follow to create them:

  1. Go to the admin section in your Google Analytics (the gear icon at the bottom left corner),
  2. Under the View column (master view), click the button “Filters” (don’t click on “All filters“ in the Account column):
  3. Click the red button “+Add Filter” (if you don’t see it or you can only apply/remove already created filters, then you don’t have edit permissions at the account level. Ask your admin to create them or give you the permissions.):
  4. Then follow the specific configuration for each of the filters below.

The filter window is your best partner for improving the quality of your Analytics data, so it will be a good idea to get familiar with it.

Valid hostname filter (ghost spam, dev environments)

Prevents traffic from:

  • Ghost spam
  • Development hostnames
  • Scraping sites
  • Cache and archive sites

This filter may be the single most effective solution against spam. In contrast with other commonly shared solutions, the hostname filter is preventative, and it rarely needs to be updated.

Ghost spam earns its name because it never really visits your site. It’s sent directly to the Google Analytics servers using a feature called Measurement Protocol, a tool that under normal circumstances allows tracking from devices that you wouldn’t imagine that could be traced, like coffee machines or refrigerators.

Real users pass through your server, then the data is sent to GA; hence it leaves valid information. Ghost spam is sent directly to GA servers, without knowing your site URL; therefore all data left is fake. Source: carloseo.com

The spammer abuses this feature to simulate visits to your site, most likely using automated scripts to send traffic to randomly generated tracking codes (UA-0000000-1).

Since these hits are random, the spammers don’t know who they’re hitting; for that reason ghost spam will always leave a fake or (not set) host. Using that logic, by creating a filter that only includes valid hostnames all ghost spam will be left out.

Where to find your hostnames

Now here comes the “tricky” part. To create this filter, you will need, to make a list of your valid hostnames.

A list of what!?

Essentially, a hostname is any place where your GA tracking code is present. You can get this information from the hostname report:

  • Go to Audience > Select Network > At the top of the table change the primary dimension to Hostname.

If your Analytics is active, you should see at least one: your domain name. If you see more, scan through them and make a list of all the ones that are valid for you.

Types of hostname you can find

The good ones:

Type

Example

Your domain and subdomains

yourdomain.com

Tools connected to your Analytics

YouTube, MailChimp

Payment gateways

Shopify, booking systems

Translation services

Google Translate

Mobile speed-up services

Google weblight

The bad ones (by bad, I mean not useful for your reports):

Type

Example/Description

Staging/development environments

staging.yourdomain.com

Internet archive sites

web.archive.org

Scraping sites that don’t bother to trim the content

The URL of the scraper

Spam

Most of the time they will show their URL, but sometimes they may use the name of a known website to try to fool you. If you see a URL that you don’t recognize, just think, “do I manage it?” If the answer is no, then it isn’t your hostname.

(not set) hostname

It usually comes from spam. On rare occasions it’s related to tracking code issues.

Below is an example of my hostname report. From the unfiltered view, of course, the master view is squeaky clean.

Now with the list of your good hostnames, make a regular expression. If you only have your domain, then that is your expression; if you have more, create an expression with all of them as we did in the fruit salad example:

Hostname REGEX (example)


yourdomain.com|hostname2|hostname3|hostname4

Important! You cannot create more than one “Include hostname filter”; if you do, you will exclude all data. So try to fit all your hostnames into one expression (you have 255 characters).

The “valid hostname filter” configuration:

  • Filter Name: Include valid hostnames
  • Filter Type: Custom > Include
  • Filter Field: Hostname
  • Filter Pattern: [hostname REGEX you created]

Campaign source filter (Crawler spam, internal sources)

Prevents traffic from:

  • Crawler spam
  • Internal third-party tools (Trello, Asana, Pingdom)

Important note: Even if these hits are shown as a referral, the field you should use in the filter is “Campaign source” — the field “Referral” won’t work.

Filter for crawler spam

The second most common type of spam is crawler. They also pretend to be a valid visit by leaving a fake source URL, but in contrast with ghost spam, these do access your site. Therefore, they leave a correct hostname.

You will need to create an expression the same way as the hostname filter, but this time, you will put together the source/URLs of the spammy traffic. The difference is that you can create multiple exclude filters.

Crawler REGEX (example)


spam1|spam2|spam3|spam4

Crawler REGEX (pre-built)


As I promised, here are latest pre-built crawler expressions that you just need to copy/paste.

The “crawler spam filter” configuration:

  • Filter Name: Exclude crawler spam 1
  • Filter Type: Custom > Exclude
  • Filter Field: Campaign source
  • Filter Pattern: [crawler REGEX]

Filter for internal third-party tools

Although you can combine your crawler spam filter with internal third-party tools, I like to have them separated, to keep them organized and more accessible for updates.

The “internal tools filter” configuration:

  • Filter Name: Exclude internal tool sources
  • Filter Pattern: [tool source REGEX]

Internal Tools REGEX (example)


trello|asana|redmine

In case, that one of the tools that you use internally also sends you traffic from real visitors, don’t filter it. Instead, use the “Exclude Internal URL Query” below.

For example, I use Trello, but since I share analytics guides on my site, some people link them from their Trello accounts.

Filters for language spam and other types of spam

The previous two filters will stop most of the spam; however, some spammers use different methods to bypass the previous solutions.

For example, they try to confuse you by showing one of your valid hostnames combined with a well-known source like Apple, Google, or Moz. Even my site has been a target (not saying that everyone knows my site; it just looks like the spammers don’t agree with my guides).

However, even if the source and host look fine, the spammer injects their message in another part of your reports like the keyword, page title, and even as a language.

In those cases, you will have to take the dimension/report where you find the spam and choose that name in the filter. It’s important to consider that the name of the report doesn’t always match the name in the filter field:

Report name

Filter field

Language

Language settings

Referral

Campaign source

Organic Keyword

Search term

Service Provider

ISP Organization

Network Domain

ISP Domain

Here are a couple of examples.

The “language spam/bot filter” configuration:

  • Filter Name: Exclude language spam
  • Filter Type: Custom > Exclude
  • Filter Field: Language settings
  • Filter Pattern: [Language REGEX]

Language Spam REGEX (Prebuilt)


\s[^\s]*\s|.{15,}|\.|,|^c$

The expression above excludes fake languages that don’t meet the required format. For example, take these weird messages appearing instead of regular languages like en-us or es-es:

Examples of language spam

The organic/keyword spam filter configuration:

  • Filter Name: Exclude organic spam
  • Filter Type: Custom > Exclude
  • Filter Field: Search term
  • Filter Pattern: [keyword REGEX]

Filters for direct bot traffic

Bot traffic is a little trickier to filter because it doesn’t leave a source like spam, but it can still be filtered with a bit of patience.

The first thing you should do is enable bot filtering. In my opinion, it should be enabled by default.

Go to the Admin section of your Analytics and click on View Settings. You will find the option “Exclude all hits from known bots and spiders” below the currency selector:

It would be wonderful if this would take care of every bot — a dream come true. However, there’s a catch: the key here is the word “known.” This option only takes care of known bots included in the “IAB known bots and spiders list.” That’s a good start, but far from enough.

There are a lot of “unknown” bots out there that are not included in that list, so you’ll have to play detective and search for patterns of direct bot traffic through different reports until you find something that can be safely filtered without risking your real user data.

To start your bot trail search, click on the Segment box at the top of any report, and select the “Direct traffic” segment.

Then navigate through different reports to see if you find anything suspicious.

Some reports to start with:

  • Service provider
  • Browser version
  • Network domain
  • Screen resolution
  • Flash version
  • Country/City

Signs of bot traffic

Although bots are hard to detect, there are some signals you can follow:

  • An unnatural increase of direct traffic
  • Old versions (browsers, OS, Flash)
  • They visit the home page only (usually represented by a slash “/” in GA)
  • Extreme metrics:
    • Bounce rate close to 100%,
    • Session time close to 0 seconds,
    • 1 page per session,
    • 100% new users.

Important! If you find traffic that checks off many of these signals, it is likely bot traffic. However, not all entries with these characteristics are bots, and not all bots match these patterns, so be cautious.

Perhaps the most useful report that has helped me identify bot traffic is the “Service Provider” report. Large corporations frequently use their own Internet service provider name.

I also have a pre-built expression for ISP bots, similar to the crawler expressions.

The bot ISP filter configuration:

  • Filter Name: Exclude bots by ISP
  • Filter Type: Custom > Exclude
  • Filter Field: ISP organization
  • Filter Pattern: [ISP provider REGEX]

ISP provider bots REGEX (prebuilt)


hubspot|^google\sllc$ |^google\sinc\.$ |alibaba\.com\sllc|ovh\shosting\sinc\.

Latest ISP bot expression

IP filter for internal traffic

We already covered different types of internal traffic, the one from test sites (with the hostname filter), and the one from third-party tools (with the campaign source filter).

Now it’s time to look at the most common and damaging of all: the traffic generated directly by you or any member of your team while working on any task for the site.

To deal with this, the standard solution is to create a filter that excludes the public IP (not private) of all locations used to work on the site.

Examples of places/people that should be filtered

  • Office
  • Support
  • Home
  • Developers
  • Hotel
  • Coffee shop
  • Bar
  • Mall
  • Any place that is regularly used to work on your site

To find the public IP of the location you are working at, simply search for “my IP” in Google. You will see one of these versions:

IP version

Example

Short IPv4

1.23.45.678

Long IPv6

2001:0db8:85a3:0000:0000:8a2e:0370:7334

No matter which version you see, make a list with the IP of each place and put them together with a REGEX, the same way we did with other filters.

  • IP address expression: IP1|IP2|IP3|IP4 and so on.

The static IP filter configuration:

  • Filter Name: Exclude internal traffic (IP)
  • Filter Type: Custom > Exclude
  • Filter Field: IP Address
  • Filter Pattern: [The IP expression]

Cases when this filter won’t be optimal:

There are some cases in which the IP filter won’t be as efficient as it used to be:

  • You use IP anonymization (required by the GDPR regulation). When you anonymize the IP in GA, the last part of the IP is changed to 0. This means that if you have 1.23.45.678, GA will pass it as 1.23.45.0, so you need to put it like that in your filter. The problem is that you might be excluding other IPs that are not yours.
  • Your Internet provider changes your IP frequently (Dynamic IP). This has become a common issue lately, especially if you have the long version (IPv6).
  • Your team works from multiple locations. The way of working is changing — now, not all companies operate from a central office. It’s often the case that some will work from home, others from the train, in a coffee shop, etc. You can still filter those places; however, maintaining the list of IPs to exclude can be a nightmare,
  • You or your team travel frequently. Similar to the previous scenario, if you or your team travels constantly, there’s no way you can keep up with the IP filters.

If you check one or more of these scenarios, then this filter is not optimal for you; I recommend you to try the “Advanced internal URL query filter” below.

URL query filter for internal traffic

If there are dozens or hundreds of employees in the company, it’s extremely difficult to exclude them when they’re traveling, accessing the site from their personal locations, or mobile networks.

Here’s where the URL query comes to the rescue. To use this filter you just need to add a query parameter. I add “?internal” to any link your team uses to access your site:

  • Internal newsletters
  • Management tools (Trello, Redmine)
  • Emails to colleagues
  • Also works by directly adding it in the browser address bar

Basic internal URL query filter

The basic version of this solution is to create a filter to exclude any URL that contains the query “?internal”.

  • Filter Name: Exclude Internal Traffic (URL Query)
  • Filter Type: Custom > Exclude
  • Filter Field: Request URI
  • Filter Pattern: \?internal

This solution is perfect for instances were the user will most likely stay on the landing page, for example, when sending a newsletter to all employees to check a new post.

If the user will likely visit more than the landing page, then the subsequent pages will be recorded.

Advanced internal URL query filter

This solution is the champion of all internal traffic filters!

It’s a more comprehensive version of the previous solution and works by filtering internal traffic dynamically using Google Tag Manager, a GA custom dimension, and cookies.

Although this solution is a bit more complicated to set up, once it’s in place:

  • It doesn’t need maintenance
  • Any team member can use it, no need to explain techy stuff
  • Can be used from any location
  • Can be used from any device, and any browser

To activate the filter, you just have to add the text “?internal” to any URL of the website.

That will insert a small cookie in the browser that will tell GA not to record the visits from that browser.

And the best of it is that the cookie will stay there for a year (unless it is manually removed), so the user doesn’t have to add “?internal” every time.

Bonus filter: Include only internal traffic

In some occasions, it’s interesting to know the traffic generated internally by employees — maybe because you want to measure the success of an internal campaign or just because you’re a curious person.

In that case, you should create an additional view, call it “Internal Traffic Only,” and use one of the internal filters above. Just one! Because if you have multiple include filters, the hit will need to match all of them to be counted.

If you configured the “Advanced internal URL query” filter, use that one. If not, choose one of the others.

The configuration is exactly the same — you only need to change “Exclude” for “Include.”

Cleaning historical data

The filters will prevent future hits from junk traffic.

But what about past affected data?

I know I told you that deleting aggregated historical data is not possible in GA. However, there’s still a way to temporarily clean up at least some of the nasty traffic that has already polluted your reports.

For this, we’ll use an advanced segment (a subset of your Analytics data). There are built-in segments like “Organic” or “Mobile,” but you can also build one using your own set of rules.

To clean our historical data, we will build a segment using all the expressions from the filters above as conditions (except the ones from the IP filter, because IPs are not stored in GA; hence, they can’t be segmented).

To help you get started, you can import this segment template.

You just need to follow the instructions on that page and replace the placeholders. Here is how it looks:

In the actual template, all text is black; the colors are just to help you visualize the conditions.

After importing it, to select the segment:

  1. Click on the box that says “All users” at the top of any of your reports
  2. From your list of segments, check the one that says “0. All Users – Clean”
  3. Lastly, uncheck the “All Users”

Now you can navigate through your reaports and all the junk traffic included in the segment will be removed.

A few things to consider when using this segment:

  • Segments have to be selected each time. A way of having it selected by default is by adding a bookmark when the segment is selected.
  • You can remove or add conditions if you need to.
  • You can edit the segment at any time to update it or add conditions (open the list of segments, then click “Actions” then “Edit”).

  • The hostname expression and third-party tools expression are different for each site.
  • If your site has a large volume of traffic, segments may sample your data when selected, so if you see the little shield icon at the top of your reports go yellow (normally is green), try choosing a shorter period (i.e. 1 year, 6 months, one month).

Conclusion: Which cake would you eat?

Having real and accurate data is essential for your Google Analytics to report as you would expect.

But if you haven’t filtered it properly, it’s almost certain that it will be filled with all sorts of junk and artificial information.

And the worst part is that if don’t realize that your reports contain bogus data, you will likely make wrong or poor decisions when deciding on the next steps for your site or business.

The filters I share above will help you prevent the three most harmful threats that are polluting your Google Analytics and don’t let you get a clear view of the actual performance of your site: spam, bots, and internal traffic.

Once these filters are in place, you can rest assured that your efforts (and money!) won’t be wasted on analyzing deceptive Google Analytics data, and your decisions will be based on solid information.

And the benefits don’t stop there. If you’re using other tools that import data from GA, for example, WordPress plugins like GADWP, excel add-ins like AnalyticsEdge, or SEO suites like Moz Pro, the benefits will trickle down to all of them as well.

Besides highlighting the importance of the filters in GA (which I hope I made clear by now), I would also love for the preparation of these filters to give you the curiosity and basis to create others that will allow you to do all sorts of remarkable things with your data.

Remember, filters not only allow you to keep away junk, you can also use them to rearrange your real user information — but more on that on another occasion.


That’s it! I hope these tips help you make more sense of your data and make accurate decisions.

Have any questions, feedback, experiences? Let me know in the comments, or reach me on Twitter @carlosesal.

Complementary resources:

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Posted in Latest NewsComments Off

How Much Data Is Missing from Analytics? And Other Analytics Black Holes

Posted by Tom.Capper

If you’ve ever compared two analytics implementations on the same site, or compared your analytics with what your business is reporting in sales, you’ve probably noticed that things don’t always match up. In this post, I’ll explain why data is missing from your web analytics platforms and how large the impact could be. Some of the issues I cover are actually quite easily addressed, and have a decent impact on traffic — there’s never been an easier way to hit your quarterly targets. ;)

I’m going to focus on GA (Google Analytics), as it’s the most commonly used provider, but most on-page analytics platforms have the same issues. Platforms that rely on server logs do avoid some issues but are fairly rare, so I won’t cover them in any depth.

Side note: Our test setup (multiple trackers & customized GA)

On Distilled.net, we have a standard Google Analytics property running from an HTML tag in GTM (Google Tag Manager). In addition, for the last two years, I’ve been running three extra concurrent Google Analytics implementations, designed to measure discrepancies between different configurations.

(If you’re just interested in my findings, you can skip this section, but if you want to hear more about the methodology, continue reading. Similarly, don’t worry if you don’t understand some of the detail here — the results are easier to follow.)

Two of these extra implementations — one in Google Tag Manager and one on page — run locally hosted, renamed copies of the Google Analytics JavaScript file (e.g. www.distilled.net/static/js/au3.js, instead of www.google-analytics.com/analytics.js) to make them harder to spot for ad blockers. I also used renamed JavaScript functions (“tcap” and “Buffoon,” rather than the standard “ga”) and renamed trackers (“FredTheUnblockable” and “AlbertTheImmutable”) to avoid having duplicate trackers (which can often cause issues).

This was originally inspired by 2016-era best practice on how to get your Google Analytics setup past ad blockers. I can’t find the original article now, but you can see a very similar one from 2017 here.

Lastly, we have (“DianaTheIndefatigable”), which just has a renamed tracker, but uses the standard code otherwise and is implemented on-page. This is to complete the set of all combinations of modified and unmodified GTM and on-page trackers.

Two of Distilled’s modified on-page trackers, as seen on https://www.distilled.net/

Overall, this table summarizes our setups:

Tracker

Renamed function?

GTM or on-page?

Locally hosted JavaScript file?

Default

No

GTM HTML tag

No

FredTheUnblockable

Yes – “tcap”

GTM HTML tag

Yes

AlbertTheImmutable

Yes – “buffoon”

On page

Yes

DianaTheIndefatigable

No

On page

No

I tested their functionality in various browser/ad-block environments by watching for the pageviews appearing in browser developer tools:

Reason 1: Ad Blockers

Ad blockers, primarily as browser extensions, have been growing in popularity for some time now. Primarily this has been to do with users looking for better performance and UX on ad-laden sites, but in recent years an increased emphasis on privacy has also crept in, hence the possibility of analytics blocking.

Effect of ad blockers

Some ad blockers block web analytics platforms by default, others can be configured to do so. I tested Distilled’s site with Adblock Plus and uBlock Origin, two of the most popular ad-blocking desktop browser addons, but it’s worth noting that ad blockers are increasingly prevalent on smartphones, too.

Here’s how Distilled’s setups fared:

(All numbers shown are from April 2018)

Setup

Vs. Adblock

Vs. Adblock with “EasyPrivacy” enabled

Vs. uBlock Origin

GTM

Pass

Fail

Fail

On page

Pass

Fail

Fail

GTM + renamed script & function

Pass

Fail

Fail

On page + renamed script & function

Pass

Fail

Fail

Seems like those tweaked setups didn’t do much!

Lost data due to ad blockers: ~10%

Ad blocker usage can be in the 15–25% range depending on region, but many of these installs will be default setups of AdBlock Plus, which as we’ve seen above, does not block tracking. Estimates of AdBlock Plus’s market share among ad blockers vary from 50–70%, with more recent reports tending more towards the former. So, if we assume that at most 50% of installed ad blockers block analytics, that leaves your exposure at around 10%.

Reason 2: Browser “do not track”

This is another privacy motivated feature, this time of browsers themselves. You can enable it in the settings of most current browsers. It’s not compulsory for sites or platforms to obey the “do not track” request, but Firefox offers a stronger feature under the same set of options, which I decided to test as well.

Effect of “do not track”

Most browsers now offer the option to send a “Do not track” message. I tested the latest releases of Firefox & Chrome for Windows 10.

Setup

Chrome “do not track”

Firefox “do not track”

Firefox “tracking protection”

GTM

Pass

Pass

Fail

On page

Pass

Pass

Fail

GTM + renamed script & function

Pass

Pass

Fail

On page + renamed script & function

Pass

Pass

Fail

Again, it doesn’t seem that the tweaked setups are doing much work for us here.

Lost data due to “do not track”: <1%

Only Firefox Quantum’s “Tracking Protection,” introduced in February, had any effect on our trackers. Firefox has a 5% market share, but Tracking Protection is not enabled by default. The launch of this feature had no effect on the trend for Firefox traffic on Distilled.net.

Reason 3: Filters

It’s a bit of an obvious one, but filters you’ve set up in your analytics might intentionally or unintentionally reduce your reported traffic levels.

For example, a filter excluding certain niche screen resolutions that you believe to be mostly bots, or internal traffic, will obviously cause your setup to underreport slightly.

Lost data due to filters: ???

Impact is hard to estimate, as setup will obviously vary on a site-by site-basis. I do recommend having a duplicate, unfiltered “master” view in case you realize too late you’ve lost something you didn’t intend to.

Reason 4: GTM vs. on-page vs. misplaced on-page

Google Tag Manager has become an increasingly popular way of implementing analytics in recent years, due to its increased flexibility and the ease of making changes. However, I’ve long noticed that it can tend to underreport vs. on-page setups.

I was also curious about what would happen if you didn’t follow Google’s guidelines in setting up on-page code.

By combining my numbers with numbers from my colleague Dom Woodman’s site (you’re welcome for the link, Dom), which happens to use a Drupal analytics add-on as well as GTM, I was able to see the difference between Google Tag Manager and misplaced on-page code (right at the bottom of the <body> tag) I then weighted this against my own Google Tag Manager data to get an overall picture of all 5 setups.

Effect of GTM and misplaced on-page code

Traffic as a percentage of baseline (standard Google Tag Manager implementation):

Google Tag Manager

Modified & Google Tag Manager

On-Page Code In <head>

Modified & On-Page Code In <head>

On-Page Code Misplaced In <Body>

Chrome

100.00%

98.75%

100.77%

99.80%

94.75%

Safari

100.00%

99.42%

100.55%

102.08%

82.69%

Firefox

100.00%

99.71%

101.16%

101.45%

90.68%

Internet Explorer

100.00%

80.06%

112.31%

113.37%

77.18%

There are a few main takeaways here:

  • On-page code generally reports more traffic than GTM
  • Modified code is generally within a margin of error, apart from modified GTM code on Internet Explorer (see note below)
  • Misplaced analytics code will cost you up to a third of your traffic vs. properly implemented on-page code, depending on browser (!)
  • The customized setups, which are designed to get more traffic by evading ad blockers, are doing nothing of the sort.

It’s worth noting also that the customized implementations actually got less traffic than the standard ones. For the on-page code, this is within the margin of error, but for Google Tag Manager, there’s another reason — because I used unfiltered profiles for the comparison, there’s a lot of bot spam in the main profile, which primarily masquerades as Internet Explorer. Our main profile is by far the most spammed, and also acting as the baseline here, so the difference between on-page code and Google Tag Manager is probably somewhat larger than what I’m reporting.

I also split the data by mobile, out of curiosity:

Traffic as a percentage of baseline (standard Google Tag Manager implementation):

Google Tag Manager

Modified & Google Tag Manager

On-Page Code In <head>

Modified & On-Page Code In <head>

On-Page Code Misplaced In <Body>

Desktop

100.00%

98.31%

100.97%

100.89%

93.47%

Mobile

100.00%

97.00%

103.78%

100.42%

89.87%

Tablet

100.00%

97.68%

104.20%

102.43%

88.13%

The further takeaway here seems to be that mobile browsers, like Internet Explorer, can struggle with Google Tag Manager.

Lost data due to GTM: 1–5%

Google Tag Manager seems to cost you a varying amount depending on what make-up of browsers and devices use your site. On Distilled.net, the difference is around 1.7%; however, we have an unusually desktop-heavy and tech-savvy audience (not much Internet Explorer!). Depending on vertical, this could easily swell to the 5% range.

Lost data due to misplaced on-page code: ~10%

On Teflsearch.com, the impact of misplaced on-page code was around 7.5%, vs Google Tag Manager. Keeping in mind that Google Tag Manager itself underreports, the total loss could easily be in the 10% range.

Bonus round: Missing data from channels

I’ve focused above on areas where you might be missing data altogether. However, there are also lots of ways in which data can be misrepresented, or detail can be missing. I’ll cover these more briefly, but the main issues are dark traffic and attribution.

Dark traffic

Dark traffic is direct traffic that didn’t really come via direct — which is generally becoming more and more common. Typical causes are:

  • Untagged campaigns in email
  • Untagged campaigns in apps (especially Facebook, Twitter, etc.)
  • Misrepresented organic
  • Data sent from botched tracking implementations (which can also appear as self-referrals)

It’s also worth noting the trend towards genuinely direct traffic that would historically have been organic. For example, due to increasingly sophisticated browser autocompletes, cross-device history, and so on, people end up “typing” a URL that they’d have searched for historically.

Attribution

I’ve written about this in more detail here, but in general, a session in Google Analytics (and any other platform) is a fairly arbitrary construct — you might think it’s obvious how a group of hits should be grouped into one or more sessions, but in fact, the process relies on a number of fairly questionable assumptions. In particular, it’s worth noting that Google Analytics generally attributes direct traffic (including dark traffic) to the previous non-direct source, if one exists.

Discussion

I was quite surprised by some of my own findings when researching this post, but I’m sure I didn’t get everything. Can you think of any other ways in which data can end up missing from analytics?

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Posted in Latest NewsComments Off

PICA Protocol: A Visualization Prescription for Impactful Data Storytelling – Whiteboard Friday

Posted by Lea-Pica

If you find your presentations are often met with a lukewarm reception, it’s a sure sign it’s time for you to invest in your data storytelling. By following a few smart rules, a structured approach to data visualization could make all the difference in how stakeholders receive and act upon your insights. In this edition of Whiteboard Friday, we’re thrilled to welcome data viz expert Lea Pica to share her strategic methodology for creating highly effective charts.

A Visualization Prescription for Impactful Storytelling

Click on the whiteboard image above to open a high-resolution version in a new tab!

Video Transcription

Hello, Moz fans. Welcome to another edition of Whiteboard Friday. I’m here to talk to you this week about a very hot topic in the digital marketing space. So my name is Lea Pica, and I am a data storytelling trainer, coach, speaker, blogger, and podcaster at LeaPica.com.

I want to tell you a little story. So as 12 years I spent as a digital analyst and SEM, I used to present insights a lot, but nothing ever happened as a result of it. People fell asleep or never responded. No action was being taken. So I decided to figure out what was happening, and I learned all these great tricks for doing it.

What I learned in my journey is that effective data visualization communicates a story quickly, clearly, accurately, and ethically, and it had really four main goals — to inform decisions, to inspire action, to galvanize people, and most importantly to communicate the value of the work that you do.

Now, there are lots of things you can do, but I was struggling to find one specific process that was going to help me get from what I was trying to communicate to getting people to act on it. So I developed my own methodology. It’s called the PICA Protocol, and it’s a visualization prescription for impactful data storytelling. What I like about this protocol is that it’s practical, approachable. It’s not complicated. It’s prescriptive, and it’s repeatable. I believe it’s going to get you where you need to go every time.

So let’s say one of your managers, clients, stakeholders is asking you for something like, “What are our most successful keyword groups?” Something delightfully vague like that. Now, before you jump into your data visualization platform and start dropping charts like it’s hot, I want you to take a step back and start with the first step in the process, which is P for purpose.

P for Purpose

So I found that every great data visualization started with a very focused question or questions.

  • Why do you exist? Get philosophical with it.
  • What need of my audience are you meeting?
  • What decisions are you going to inform?

These questions help you get really focused about what you’re going to present and avoid the sort of needle in a haystack approach to seeing what might stick.

So the answers to these questions are going to help you make an important decision, to choose an appropriate chart type for the message that you’re trying to convey. Some of the ways you want to do that — I hear you guys are like into keywords a little bit — you want to listen for the keywords of what people are asking you for. So in this case, we have “most successful.” Okay, that indicates a comparison. Different types or campaigns or groups, those are categories. So it sounds like what we’re going for is a categorical comparison. There are other kinds of keywords you can look for, like changing over time, how this affects that. Answers or opinions. All of those are going to help you determine your most appropriate visual.

Now, in this case, we have a categorical comparison, so I always go back to basics. It’s an oldie but goodie, but we’re going to do the tried-and-true bar chart. It’s universally understood and doesn’t have a learning curve. What I would not recommend are pie charts. No, no, no. Unless you only have two segments in your visual and one is unmistakably larger than the other, pie charts are not your best choice for communicating categorical comparison, composition, or ranking.

I for Insight

So we have our choice. We’re now going to move on to the next step in the methodology, which is I for insight. So an insight is something that gives a person a capacity to understand something quickly, accurately, and intuitively. Think of those criteria.

So here, does my display surface the story and answer these questions intuitively? That’s our criteria. The components of that are:

  • Layout and orientation. So how is the chart configured? Very often we’ll use vertical bar charts for categorical comparison, but that will end up having diagonal labels if they’re really long, and unless your audience walks around like this all the time, it’s going to be confusing because that would be weird. So you want to make sure it’s oriented well.
  • Labeling. In the case of bars, I always prefer to label each bar directly rather than relying on just an axis, because then their eyes aren’t jumping from bar to axis to bar to axis and they’re paying more attention to you. That’s also for line charts. Very often I’ll label a line with a maximum, a minimum, and maybe the most important data point.
  • Interpretation of the data and where we’re placing it, the location.
    • So our interpretation, is it objective or is it subjective? So subjective words are like better or worse or stupid or awesome. Those are opinions. But objective words are higher, lower, most efficient, least efficient. So you really want your observations to be objective.
    • Have you presented it ethically? Or have you manipulated the view in a way that isn’t telling a really ethical picture, like adjusting a bar axis above zero, which is a no-no? But you can do that with a line graph in certain cases. So look for those nuances. You want to basically ask yourself, “Would I be able to uphold this visual in a court of law or sleep at night?”
    • Location of that insight. So very often we’ll put our insights, our interpretation down here or in really tiny letters up here. Then up here we’ll put big letters saying this is sales, my keyword category. No. What we want to do is we want to put our interpretation up here. This top area is the most important real estate on your visual. That’s where their eyes are going to look first. So think of this like a BuzzFeed headline for your visual. What do you want them to take away? You can always put what the chart is here in a little subtitle.
  • Make recommendations. Because that’s what a really powerful visual is going to do.
    • I always suggest having two recommendations at least, because this way you’re empowering your audience with a choice. This way you can actually be subjective. That is okay in this case, because that’s your unique subject matter expertise.
    • Are your recommendations accountable to specific people? Are they feasible?
    • What’s the cost of not acting on your recommendations? Put some urgency behind it. So I like to put my recommendations in a little box or callout on the side here so it’s really clear after I’ve presented my facts.

C for Context

The next step in the methodology is C for context. What this is saying is, “Do I have all the data points I need to paint a complete picture, or is there more to this story?” So some additional lenses you might find useful are past period comparison, targets or benchmarks are useful, segmentation, things like geography, mobile device. Or what are the typical questions or arguments that your audience has when you present data? They can be super value contextual points.

In this case, I might decide that while they care about the number of sales, because that’s most successful to them, I care about the keywords “conversion rates.” So I’m going to add a second bar chart here like this, and I’m going to see there’s a different story that’s popping out here now.

Now, this is where your data storytelling really comes into play. This particular strategy is called a table lens or a side-by-side bar chart. It’s what I recommend if you want to combine two categorical metrics together.

A for Aesthetics

Now, the last step in the methodology is A for aesthetics. Aesthetics are how things look. So it’s not about making it look pretty. No, it’s asking, “Does my viz comply with brain best practices of how we absorb information?”

1. Decrease visual noise

So the first step in doing that is we want to decrease visual noise, because that creates a lot of tension. So decreasing noise will increase the chance of a happy brain.

Now, I’m a crunchy granola hippie, so I love to detox every day. I’ve developed a data visualization detox that entails removing things like grid lines, borders, axis lines, line markers, and backgrounds. Get all of that junk out of there, really clean up. You can align everything to the left to make sure that the brain is following things properly down. Don’t center everything.

2. Use uniform colors (plus one standout color for emphasis)

Now, you’ll notice that most of my bars here have a uniform color — simple black. I like to color everything one color, because then I’ll use a separate, standout color, like this blue, to strategically emphasize my key message. You might notice that I did that throughout this step for the words that I want you to pick out. That’s why I colored these particular bars, because this feels like the story to me, because that is the storytelling part of this message.

Notice that I also colored the category in my observation to create a connective tissue between these two items. So using color intentionally means things like using green for good and red for bad, not arbitrarily, and then maybe blue for what’s important.

3. Source your data

Then finally, you always want to source your data. That increases the trust. So you want to put your platform and your date range. Really simple.

So this is the anatomy of an awesome data viz. I’ve adapted it from a great book called “Good Charts” by my friend, Scott Berinato. What I have found that by using this protocol, you’re going to end up with these wonderful, raving fans who are going to love your work and understand your value. I included a little kitty fan because I can. It’s my Whiteboard Friday.

So that is the protocol. I actually have included a free gift for you today. If you click the link at the end of this post, you’ll be able to sign up for a Chart Detox Checklist, a full printable PICA Protocol prescription and a Chart Choosing Guide.

Get the PICA Protocol prescription

I would actually love to hear from you. What are the kinds of struggles that you have in presenting your insights to stakeholders, where you just feel like they’re not getting the value of what you’re doing? I’d love to hear any questions you have about the methodology as well.

So thank you for watching this edition of Whiteboard Friday. I hope you enjoyed it. We’ll see you next week, and please remember to viz responsibly, my friends. Namaste.

Video transcription by Speechpad.com

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Related Articles

Posted in Latest NewsComments Off

What Google’s GDPR Compliance Efforts Mean for Your Data: Two Urgent Actions

Posted by willcritchlow

It should be quite obvious for anyone that knows me that I’m not a lawyer, and therefore that what follows is not legal advice. For anyone who doesn’t know me: I’m not a lawyer, I’m certainly not your lawyer, and what follows is definitely not legal advice.

With that out of the way, I wanted to give you some bits of information that might feed into your GDPR planning, as they come up more from the marketing side than the pure legal interpretation of your obligations and responsibilities under this new legislation. While most legal departments will be considering the direct impacts of the GDPR on their own operations, many might miss the impacts that other companies’ (namely, in this case, Google’s) compliance actions have on your data.

But I might be getting a bit ahead of myself: it’s quite possible that not all of you know what the GDPR is, and why or whether you should care. If you do know what it is, and you just want to get to my opinions, go ahead and skip down the page.

What is the GDPR?

The tweet-length version is that the GDPR (General Data Protection Regulation) is new EU legislation covering data protection and privacy for EU citizens, and it applies to all companies offering goods or services to people in the EU.

Even if you aren’t based in the EU, it applies to your company if you have customers who are, and it has teeth (fines of up to the greater of 4% of global revenue or EUR20m). It comes into force on May 25. You have probably heard about it through the myriad organizations who put you on their email list without asking and are now emailing you to “opt back in.”

In most companies, it will not fall to the marketing team to research everything that has to change and achieve compliance, though it is worth getting up to speed with at least the high-level outline and in particular its requirements around informed consent, which is:

“…any freely given, specific, informed, and unambiguous indication of the data subject’s wishes by which he or she, by a statement or by a clear affirmative action, signifies agreement to the processing of personal data relating to him or her.”

As always, when laws are made about new technology, there are many questions to be resolved, and indeed, jokes to be made:

But my post today isn’t about what you should do to get compliant — that’s specific to your circumstances — and a ton has been written about this already:

My intention is not to write a general guide, but rather to warn you about two specific things you should be doing with analytics (Google Analytics in particular) as a result of changes Google is making because of GDPR.

Unexpected consequences of GDPR

When you deal directly with a person in the EU, and they give you personally identifiable information (PII) about themselves, you are typically in what is called the “data controller” role. The GDPR also identifies another role, which it calls “data processor,” which is any other company your company uses as a supplier and which handles that PII. When you use a product like Google Analytics on your website, Google is taking the role of data processor. While most of the restrictions of the GDPR apply to you as the controller, the processor must also comply, and it’s here that we see some potentially unintended (but possibly predictable) consequences of the legislation.

Google is unsurprisingly seeking to minimize their risk (I say it’s unsurprising because those GDPR fines could be as large as $ 4.4 billion based on last year’s revenue if they get it wrong). They are doing this firstly by pushing as much of the obligation onto you (the data controller) as possible, and secondly, by going further by default than the GDPR requires and being more aggressive than the regulation requires in shutting down accounts that infringe their terms (regardless of whether the infringement also infringes the GDPR).

This is entirely rational — with GA being in most cases a product offered for free, and the value coming to Google entirely in the aggregate, it makes perfect sense to limit their risks in ways that don’t degrade their value, and to just kick risky setups off the platform rather than taking on extreme financial risk for individual free accounts.

It’s not only Google, by the way. There are other suppliers doing similar things which will no doubt require similar actions, but I am focusing on Google here simply because GA is pervasive throughout the web marketing world. Some companies are even going as far as shutting down entirely for EU citizens (like unroll.me). See this Twitter thread of others.

Consequence 1: Default data retention settings for GA will delete your data

Starting on May 25, Google will be changing the default for data retention, meaning that if you don’t take action, certain data older than the cutoff will be automatically deleted.

You can read more about the details of the change on Krista Seiden’s personal blog (Krista works at Google, but this post is written in her personal capacity).

The reason I say that this isn’t strictly a GDPR thing is that it is related to changes Google is making on their end to ensure that they comply with their obligations as a data processor. It gives you tools you might need but isn’t strictly related to your GDPR compliance. There is no particular “right” answer to the question of how long you need to/should be/are allowed to keep this data stored in GA under the GDPR, but by my reading, given that it shouldn’t be PII anyway (see below) it isn’t really a GDPR question for most organizations. In particular, there is no particular reason to think that Google’s default is the correct/mandated/only setting you can choose under the GDPR.

Action: Review the promises being made by your legal team and your new privacy policy to understand the correct timeline setting for your org. In the absence of explicit promises to your users, my understanding is that you can retain any of this data you were allowed to capture in the first place unless you receive a deletion request against it. So while most orgs will have at least some changes to make to privacy policies at a minimum, most GA users can change back to retain this data indefinitely.

Consequence 2: Google is deleting GA accounts for capturing PII

It has long been against the Terms of Service to store any personally identifiable information (PII) in Google Analytics. Recently, though, it appears that Google has become far more diligent in checking for the presence of PII and robust in their handling of accounts found to contain any. Put more simply, Google will delete your account if they find PII.

It’s impossible to know for sure that this is GDPR-related, but being able if necessary to demonstrate to regulators that they are taking strict actions against anyone violating their PII-related terms is an obvious move for Google to reduce the risk they face as a Data Processor. It makes particular sense in an area where the vast majority of accounts are free accounts. Much like the previous point, and the reason I say that this is related to Google’s response to the GDPR coming into force, is that it would be perfectly possible to get your users’ permission to record their data in third-party services like GA, and fully comply with the regulations. Regardless of the permissions your users give you, Google’s GDPR-related crackdown (and heavier enforcement of the related terms that have been present for some time) means that it’s a new and greater risk than it was before.

Action: Audit your GA profile and implementation for PII risks:

  • There are various ways you can search within GA itself to find data that could be personally identifying in places like page titles, URLs, custom data, etc. (see these two excellent guides)
  • You can also audit your implementation by reviewing rules in tag manager and/or reviewing the code present on key pages. The most likely suspects are the places where people log in, take key actions on your site, give you additional personal information, or check out

Don’t take your EU law advice from big US tech companies

The internal effort and coordination required at Google to do their bit to comply even “just” as data processor is significant. Unfortunately, there are strong arguments that this kind of ostensibly user-friendly regulation which incurs outsize compliance burdens on smaller companies will cement the duopoly and dominance of Google and Facebook and enables them to pass the costs and burdens of compliance onto sectors that are already struggling.

Regardless of the intended or unintended consequences of the regulation, it seems clear to me that we shouldn’t be basing our own businesses’ (and our clients’) compliance on self-interested advice and actions from the tech giants. No matter how impressive their own compliance, I’ve been hugely underwhelmed by guidance content they’ve put out. See, for example, Google’s GDPR “checklist” — not exactly what I’d hope for:

Client Checklist: As a marketer we know you need to select products that are compliant and use personal data in ways that are compliant. We are committed to complying with the GDPR and would encourage you to check in on compliance plans within your own organisation. Key areas to think about:  How does your organisation ensure user transparency and control around data use? Do you explain to your users the types of data you collect and for what purposes? Are you sure that your organisation has the right consents in place where these are needed under the GDPR? Do you have all of the relevant consents across your ad supply chain? Does your organisation have the right systems to record user preferences and consents? How will you show to regulators and partners that you meet the principles of the GDPR and are an accountable organisation?

So, while I’m not a lawyer, definitely not your lawyer, and this is not legal advice, if you haven’t already received any advice, I can say that you probably can’t just follow Google’s checklist to get compliant. But you should, as outlined above, take the specific actions you need to take to protect yourself and your business from their compliance activities.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Posted in Latest NewsComments Off

Yext adds TripAdvisor to listings network, conversational UI for local data updates

Messaging-based updates especially useful for SMBs and local store managers.
Please visit Search Engine Land for the full article.



Please visit Search Engine Land for the full article.


Search Engine Land: News & Info About SEO, PPC, SEM, Search Engines & Search Marketing

Posted in Latest NewsComments Off

Moz’s Link Data Used to Suck… But Not Anymore! The New Link Explorer is Here – Whiteboard Friday

Posted by randfish

Earlier this week we launched our brand-new link building tool, and we’re happy to say that Link Explorer addresses and improves upon a lot of the big problems that have plagued our legacy link tool, Open Site Explorer. In today’s Whiteboard Friday, Rand transparently lists out many of the biggest complaints we’ve heard about OSE over the years and explains the vast improvements Link Explorer provides, from DA scores updated daily to historic link data to a huge index of almost five trillion URLs.

Moz's Link Data Used to Suck... But Not Anymore! The New Link Explorer is Here - Whiteboard Friday

Click on the whiteboard image above to open a high-resolution version in a new tab!


Video Transcription

Howdy, Moz fans, and welcome to another edition of Whiteboard Friday. This week I’m very excited to say that Moz’s Open Site Explorer product, which had a lot of challenges with it, is finally being retired, and we have a new product, Link Explorer, that’s taking its place. So let me walk you through why and how Moz’s link data for the last few years has really kind of sucked. There’s no two ways about it.

If you heard me here on Whiteboard Friday, if you watched me at conferences, if you saw me blogging, you’d probably see me saying, “Hey, I personally use Ahrefs, or I use Majestic for my link research.” Moz has a lot of other good tools. The crawler is excellent. Moz Pro is good. But Open Site Explorer was really lagging, and today, that’s not the case. Let me walk you through this.

The big complaints about OSE/Mozscape

1. The index was just too small

Moz's Link Data Used to Suck... But Not Anymore! The New Link Explorer is Here - Whiteboard Friday

Mozscape was probably about a fifth to a tenth the size of its competitors. While it got a lot of the quality good links of the web, it just didn’t get enough. As SEOs, we need to know all of the links, the good ones and the bad ones.

2. The data was just too old

Moz's Link Data Used to Suck... But Not Anymore! The New Link Explorer is Here - Whiteboard Friday

So, in Mozscape, a link that you built on November 1st, you got a link added to a website, you’re very proud of yourself. That’s excellent. You should expect that a link tool should pick that up within maybe a couple weeks, maybe three weeks at the outside. Google is probably picking it up within just a few days, sometimes hours.

Yet, when Mozscape would crawl that, it would often be a month or more later, and by the time Mozscape processed its index, it could be another 40 days after that, meaning that you could see a 60- to 80-day delay, sometimes even longer, between when your link was built and when Mozscape actually found it. That sucks.

3. PA/DA scores took forever to update

Moz's Link Data Used to Suck... But Not Anymore! The New Link Explorer is Here - Whiteboard Friday

PA/DA scores, likewise, took forever to update because of this link problem. So the index would say, oh, your DA is over here. You’re at 25, and now maybe you’re at 30. But in reality, you’re probably far ahead of that, because you’ve been building a lot of links that Mozscape just hasn’t picked up yet. So this is this lagging indicator. Sometimes there would be links that it just didn’t even know about. So PA and DA just wouldn’t be as accurate or precise as you’d want them to be.

4. Some scores were really confusing and out of date

Moz's Link Data Used to Suck... But Not Anymore! The New Link Explorer is Here - Whiteboard Friday

MozRank and MozTrust relied on essentially the original Google PageRank paper from 1997, which there’s no way that’s what’s being used today. Google certainly uses some view of link equity that’s passed between links that is similar to PageRank, and I think they probably internally call that PageRank, but it looks nothing like what MozRank was called.

Likewise, MozTrust, way out of date, from a paper in I think 2002 or 2003. Much more advancements in search have happened since then.

Spam score was also out of date. It used a system that was correlated with what spam looked like three, four years ago, so much more up to date than these two, but really not nearly as sophisticated as what Google is doing today. So we needed to toss those out and find their replacements as well.

5. There was no way to see links gained and lost over time

Moz's Link Data Used to Suck... But Not Anymore! The New Link Explorer is Here - Whiteboard Friday

Mozscape had no way to see gained and lost links over time, and folks thought, “Gosh, these other tools in the SEO space give me this ability to show me links that their index has discovered or links they’ve seen that we’ve lost. I really want that.”

6. DA didn’t correlate as well as it should have

Moz's Link Data Used to Suck... But Not Anymore! The New Link Explorer is Here - Whiteboard Friday

So over time, DA became a less and less indicative measure of how well you were performing in Google’s rankings. That needed to change as well. The new DA, by the way, much, much better on this front.

7. Bulk metrics checking and link reporting was too hard and manual

Moz's Link Data Used to Suck... But Not Anymore! The New Link Explorer is Here - Whiteboard Friday

So folks would say, “Hey, I have this giant spreadsheet with all my link data. I want to upload that. I want you guys to crawl it. I want to go fetch all your metrics. I want to get DA scores for these hundreds or thousands of websites that I’ve got. How do I do that?” We didn’t provide a good way for you to do that either unless you were willing to write code and loop in our API.

8. People wanted distribution of their links by DA

Moz's Link Data Used to Suck... But Not Anymore! The New Link Explorer is Here - Whiteboard Friday

They wanted distributions of their links by domain authority. Show me where my links come from, yes, but also what sorts of buckets of DA do I have versus my competition? That was also missing.

So, let me show you what the new Link Explorer has.

Moz's new Link Explorer

Click on the whiteboard image above to open a high-resolution version in a new tab!

Wow, look at that magical board change, and it only took a fraction of a second. Amazing.

What Link Explorer has done, as compared to the old Open Site Explorer, is pretty exciting. I’m actually very proud of the team. If you know me, you know I am a picky SOB. I usually don’t even like most of the stuff that we put out here, but oh my god, this is quite an incredible product.

1. Link Explorer has a GIANT index

Moz's Link Data Used to Suck... But Not Anymore! The New Link Explorer is Here - Whiteboard Friday

So I mentioned index size was a big problem. Link Explorer has got a giant index. Frankly, it’s about 20 times larger than what Open Site Explorer had and, as you can see, very, very competitive with the other services out there. Majestic Fresh says they have about a trillion URLs from their I think it’s the last 60 days. Ahrefs, about 3 trillion. Majestic’s historic, which goes all time, has about 7 trillion, and Moz, just in the last 90 days, which I think is our index — maybe it’s a little shorter than that, 60 days — 4.7 trillion, so almost 5 trillion URLs. Just really, really big. It covers a huge swath of the web, which is great.

2. All data updates every 24 hours

Moz's Link Data Used to Suck... But Not Anymore! The New Link Explorer is Here - Whiteboard Friday

So, unlike the old index, it is very fresh. Every time it finds a new link, it updates PA scores and DA scores. The whole interface can show you all the links that it found just yesterday every morning.

3. DA and PA are tracked daily for every site

Moz's Link Data Used to Suck... But Not Anymore! The New Link Explorer is Here - Whiteboard Friday

You don’t have to track them yourself. You don’t have to put them into your campaigns. Every time you go and visit a domain, you will see this graph showing you domain authority over time, which has been awesome.

For my new company, I’ve been tracking all the links that come in to SparkToro, and I can see my DA rising. It’s really exciting. I put out a good blog post, I get a bunch of links, and my DA goes up the next day. How cool is that?

4. Old scores are gone, and new scores are polished and high quality

Moz's Link Data Used to Suck... But Not Anymore! The New Link Explorer is Here - Whiteboard Friday

So we got rid of MozRank and MozTrust, which were very old metrics and, frankly, very few people were using them, and most folks who were using them didn’t really know how to use them. PA basically takes care of both of them. It includes the weight of links that come to you and the trustworthiness. So that makes more sense as a metric.

Spam score is now on a 0 to 100% risk model instead of the old 0 to 17 flags and the flags correlate to some percentage. So 0 to 100 risk model. Spam score is basically just a machine learning built model against sites that Google penalized or banned.

So we took a huge amount of domains. We ran their names through Google. If they couldn’t rank for their own name, we said they were penalized. If we did a site: the domain.com and Google had de-indexed them, we said they were banned. Then we built this risk model. So in the 90% that means 90% of sites that had these qualities were penalized or banned. 2% means only 2% did. If you have a 30% spam score, that’s not too bad. If you have a 75% spam score, it’s getting a little sketchy.

5. Discovered and lost links are available for every site, every day

Moz's Link Data Used to Suck... But Not Anymore! The New Link Explorer is Here - Whiteboard Friday

So again, for this new startup that I’m doing, I’ve been watching as I get new links and I see where they come from, and then sometimes I’ll reach out on Twitter and say thank you to those folks who are linking to my blog posts and stuff. But it’s very, very cool to see links that I gain and links that I lose every single day. This is a feature that Ahrefs and Majestic have had for a long time, and frankly Moz was behind on this. So I’m very glad that we have it now.

6. DA is back as a high-quality leading indicator of ranking ability

Moz's Link Data Used to Suck... But Not Anymore! The New Link Explorer is Here - Whiteboard Friday

So, a note that is important: everyone’s DA has changed. Your DA has changed. My DA has changed. Moz’s DA changed. Google’s DA changed. I think it went from a 98 to a 97. My advice is take a look at yourself versus all your competitors that you’re trying to rank against and use that to benchmark yourself. The old DA was an old model on old data on an old, tiny index. The new one is based on this 4.7 trillion size index. It is much bigger. It is much fresher. It is much more accurate. You can see that in the correlations.

7. Building link lists, tracking links that you want to acquire, and bulk metrics checking is now easy

Moz's Link Data Used to Suck... But Not Anymore! The New Link Explorer is Here - Whiteboard Friday

Building link lists, tracking links that you want to acquire, and bulk metrics checking, which we never had before and, in fact, not a lot of the other tools have this link tracking ability, is now available through possibly my favorite feature in the tool called Link Tracking Lists. If you’ve used Keyword Explorer and you’ve set up your keywords to watch those over time and to build a keyword research set, very, very similar. If you have links you want to acquire, you add them to this list. If you have links that you want to check on, you add them to this list. It will give you all the metrics, and it will tell you: Does this link to your website that you can associate with a list, or does it not? Or does it link to some page on the domain, but maybe not exactly the page that you want? It will tell that too. Pretty cool.

8. Link distribution by DA

Moz's Link Data Used to Suck... But Not Anymore! The New Link Explorer is Here - Whiteboard Friday

Finally, we do now have link distribution by DA. You can find that right on the Overview page at the bottom.

Look, I’m not saying Link Explorer is the absolute perfect, best product out there, but it’s really, really damn good. I’m incredibly proud of the team. I’m very proud to have this product out there.

If you’d like, I’ll be writing some more about how we went about building this product and a bunch of agency folks that we spent time with to develop this, and I would like to thank all of them of course. A huge thank you to the Moz team.

I hope you’ll do me a favor. Check out Link Explorer. I think, very frankly, this team has earned 30 seconds of your time to go check it out.

Try out Link Explorer!

All right. Thanks, everyone. We’ll see you again for another edition of Whiteboard Friday. Take care.

Video transcription by Speechpad.com

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Posted in Latest NewsComments Off

SearchCap: Bing Shopping Ads, Google Shopping carousels & Facebook data check

Below is what happened in search today, as reported on Search Engine Land and from other places across the web.

The post SearchCap: Bing Shopping Ads, Google Shopping carousels & Facebook data check appeared first on Search Engine Land.



Please visit Search Engine Land for the full article.


Search Engine Land: News & Info About SEO, PPC, SEM, Search Engines & Search Marketing

Posted in Latest NewsComments Off

Google Confirms Chrome Usage Data Used to Measure Site Speed

Posted by Tom-Anthony

During a discussion with Google’s John Mueller at SMX Munich in March, he told me an interesting bit of data about how Google evaluates site speed nowadays. It has gotten a bit of interest from people when I mentioned it at SearchLove San Diego the week after, so I followed up with John to clarify my understanding.

The short version is that Google is now using performance data aggregated from Chrome users who have opted in as a datapoint in the evaluation of site speed (and as a signal with regards to rankings). This is a positive move (IMHO) as it means we don’t need to treat optimizing site speed for Google as a separate task from optimizing for users.

Previously, it has not been clear how Google evaluates site speed, and it was generally believed to be measured by Googlebot during its visits — a belief enhanced by the presence of speed charts in Search Console. However, the onset of JavaScript-enabled crawling made it less clear what Google is doing — they obviously want the most realistic data possible, but it’s a hard problem to solve. Googlebot is not built to replicate how actual visitors experience a site, and so as the task of crawling became more complex, it makes sense that Googlebot may not be the best mechanism for this (if it ever was the mechanism).

In this post, I want to recap the pertinent data around this news quickly and try to understand what this may mean for users.

Google Search Console

Firstly, we should clarify our understand of what the “time spent downloading a page” metric in Google Search Console is telling us. Most of us will recognize graphs like this one:

Until recently, I was unclear about exactly what this graph was telling me. But handily, John Mueller comes to the rescue again with a detailed answer [login required] (hat tip to James Baddiley from Chillisauce.com for bringing this to my attention):

John clarified what this graph is showing:

It’s technically not “downloading the page” but rather “receiving data in response to requesting a URL” – it’s not based on rendering the page, it includes all requests made.

And that it is:

this is the average over all requests for that day

Because Google may be fetching a very different set of resources every day when it’s crawling your site, and because this graph does not account for anything to do with page rendering, it is not useful as a measure of the real performance of your site.

For that reason, John points out that:

Focusing blindly on that number doesn’t make sense.

With which I quite agree. The graph can be useful for identifying certain classes of backend issues, but there are also probably better ways for you to do that (e.g. WebPageTest.org, of which I’m a big fan).

Okay, so now we understand that graph and what it represents, let’s look at the next option: the Google WRS.

Googlebot & the Web Rendering Service

Google’s WRS is their headless browser mechanism based on Chrome 41, which is used for things like “Fetch as Googlebot” in Search Console, and is increasingly what Googlebot is using when it crawls pages.

However, we know that this isn’t how Google evaluates pages because of a Twitter conversation between Aymen Loukil and Google’s Gary Illyes. Aymen wrote up a blog post detailing it at the time, but the important takeaway was that Gary confirmed that WRS is not responsible for evaluating site speed:

Twitter conversation with Gary Ilyes

At the time, Gary was unable to clarify what was being used to evaluate site performance (perhaps because the Chrome User Experience Report hadn’t been announced yet). It seems as though things have progressed since then, however. Google is now able to tell us a little more, which takes us on to the Chrome User Experience Report.

Chrome User Experience Report

Introduced in October last year, the Chrome User Experience Report “is a public dataset of key user experience metrics for top origins on the web,” whereby “performance data included in the report is from real-world conditions, aggregated from Chrome users who have opted-in to syncing their browsing history and have usage statistic reporting enabled.”

Essentially, certain Chrome users allow their browser to report back load time metrics to Google. The report currently has a public dataset for the top 1 million+ origins, though I imagine they have data for many more domains than are included in the public data set.

In March I was at SMX Munich (amazing conference!), where along with a small group of SEOs I had a chat with John Mueller. I asked John about how Google evaluates site speed, given that Gary had clarified it was not the WRS. John was kind enough to shed some light on the situation, but at that point, nothing was published anywhere.

However, since then, John has confirmed this information in a Google Webmaster Central Hangout [15m30s, in German], where he explains they’re using this data along with some other data sources (he doesn’t say which, though notes that it is in part because the data set does not cover all domains).

At SMX John also pointed out how Google’s PageSpeed Insights tool now includes data from the Chrome User Experience Report:

The public dataset of performance data for the top million domains is also available in a public BigQuery project, if you’re into that sort of thing!

We can’t be sure what all the other factors Google is using are, but we now know they are certainly using this data. As I mentioned above, I also imagine they are using data on more sites than are perhaps provided in the public dataset, but this is not confirmed.

Pay attention to users

Importantly, this means that there are changes you can make to your site that Googlebot is not capable of detecting, which are still detected by Google and used as a ranking signal. For example, we know that Googlebot does not support HTTP/2 crawling, but now we know that Google will be able to detect the speed improvements you would get from deploying HTTP/2 for your users.

The same is true if you were to use service workers for advanced caching behaviors — Googlebot wouldn’t be aware, but users would. There are certainly other such examples.

Essentially, this means that there’s no longer a reason to worry about pagespeed for Googlebot, and you should instead just focus on improving things for your users. You still need to pay attention to Googlebot for crawling purposes, which is a separate task.

If you are unsure where to look for site speed advice, then you should look at:

That’s all for now! If you have questions, please comment here and I’ll do my best! Thanks!

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Posted in Latest NewsComments Off

Microsoft adds Reddit data to Bing search results, Power BI analytics tool

Reddit posts will appear in Bing’s search results, and its data will be piped into Power BI for marketers to track brand-related comments.

The post Microsoft adds Reddit data to Bing search results, Power BI analytics tool appeared first on Search Engine Land.



Please visit Search Engine Land for the full article.


Search Engine Land: News & Info About SEO, PPC, SEM, Search Engines & Search Marketing

Posted in Latest NewsComments Off

Don’t Be Fooled by Data: 4 Data Analysis Pitfalls & How to Avoid Them

Posted by Tom.Capper

Digital marketing is a proudly data-driven field. Yet, as SEOs especially, we often have such incomplete or questionable data to work with, that we end up jumping to the wrong conclusions in our attempts to substantiate our arguments or quantify our issues and opportunities.

In this post, I’m going to outline 4 data analysis pitfalls that are endemic in our industry, and how to avoid them.

1. Jumping to conclusions

Earlier this year, I conducted a ranking factor study around brand awareness, and I posted this caveat:

“…the fact that Domain Authority (or branded search volume, or anything else) is positively correlated with rankings could indicate that any or all of the following is likely:

  • Links cause sites to rank well
  • Ranking well causes sites to get links
  • Some third factor (e.g. reputation or age of site) causes sites to get both links and rankings”
    ~ Me

However, I want to go into this in a bit more depth and give you a framework for analyzing these yourself, because it still comes up a lot. Take, for example, this recent study by Stone Temple, which you may have seen in the Moz Top 10 or Rand’s tweets, or this excellent article discussing SEMRush’s recent direct traffic findings. To be absolutely clear, I’m not criticizing either of the studies, but I do want to draw attention to how we might interpret them.

Firstly, we do tend to suffer a little confirmation bias — we’re all too eager to call out the cliché “correlation vs. causation” distinction when we see successful sites that are keyword-stuffed, but all too approving when we see studies doing the same with something we think is or was effective, like links.

Secondly, we fail to critically analyze the potential mechanisms. The options aren’t just causation or coincidence.

Before you jump to a conclusion based on a correlation, you’re obliged to consider various possibilities:

  • Complete coincidence
  • Reverse causation
  • Joint causation
  • Linearity
  • Broad applicability

If those don’t make any sense, then that’s fair enough — they’re jargon. Let’s go through an example:

Before I warn you not to eat cheese because you may die in your bedsheets, I’m obliged to check that it isn’t any of the following:

  • Complete coincidence - Is it possible that so many datasets were compared, that some were bound to be similar? Why, that’s exactly what Tyler Vigen did! Yes, this is possible.
  • Reverse causation - Is it possible that we have this the wrong way around? For example, perhaps your relatives, in mourning for your bedsheet-related death, eat cheese in large quantities to comfort themselves? This seems pretty unlikely, so let’s give it a pass. No, this is very unlikely.
  • Joint causation - Is it possible that some third factor is behind both of these? Maybe increasing affluence makes you healthier (so you don’t die of things like malnutrition), and also causes you to eat more cheese? This seems very plausible. Yes, this is possible.
  • Linearity - Are we comparing two linear trends? A linear trend is a steady rate of growth or decline. Any two statistics which are both roughly linear over time will be very well correlated. In the graph above, both our statistics are trending linearly upwards. If the graph was drawn with different scales, they might look completely unrelated, like this, but because they both have a steady rate, they’d still be very well correlated. Yes, this looks likely.
  • Broad applicability - Is it possible that this relationship only exists in certain niche scenarios, or, at least, not in my niche scenario? Perhaps, for example, cheese does this to some people, and that’s been enough to create this correlation, because there are so few bedsheet-tangling fatalities otherwise? Yes, this seems possible.

So we have 4 “Yes” answers and one “No” answer from those 5 checks.

If your example doesn’t get 5 “No” answers from those 5 checks, it’s a fail, and you don’t get to say that the study has established either a ranking factor or a fatal side effect of cheese consumption.

A similar process should apply to case studies, which are another form of correlation — the correlation between you making a change, and something good (or bad!) happening. For example, ask:

  • Have I ruled out other factors (e.g. external demand, seasonality, competitors making mistakes)?
  • Did I increase traffic by doing the thing I tried to do, or did I accidentally improve some other factor at the same time?
  • Did this work because of the unique circumstance of the particular client/project?

This is particularly challenging for SEOs, because we rarely have data of this quality, but I’d suggest an additional pair of questions to help you navigate this minefield:

  • If I were Google, would I do this?
  • If I were Google, could I do this?

Direct traffic as a ranking factor passes the “could” test, but only barely — Google could use data from Chrome, Android, or ISPs, but it’d be sketchy. It doesn’t really pass the “would” test, though — it’d be far easier for Google to use branded search traffic, which would answer the same questions you might try to answer by comparing direct traffic levels (e.g. how popular is this website?).

2. Missing the context

If I told you that my traffic was up 20% week on week today, what would you say? Congratulations?

What if it was up 20% this time last year?

What if I told you it had been up 20% year on year, up until recently?

It’s funny how a little context can completely change this. This is another problem with case studies and their evil inverted twin, traffic drop analyses.

If we really want to understand whether to be surprised at something, positively or negatively, we need to compare it to our expectations, and then figure out what deviation from our expectations is “normal.” If this is starting to sound like statistics, that’s because it is statistics — indeed, I wrote about a statistical approach to measuring change way back in 2015.

If you want to be lazy, though, a good rule of thumb is to zoom out, and add in those previous years. And if someone shows you data that is suspiciously zoomed in, you might want to take it with a pinch of salt.

3. Trusting our tools

Would you make a multi-million dollar business decision based on a number that your competitor could manipulate at will? Well, chances are you do, and the number can be found in Google Analytics. I’ve covered this extensively in other places, but there are some major problems with most analytics platforms around:

  • How easy they are to manipulate externally
  • How arbitrarily they group hits into sessions
  • How vulnerable they are to ad blockers
  • How they perform under sampling, and how obvious they make this

For example, did you know that the Google Analytics API v3 can heavily sample data whilst telling you that the data is unsampled, above a certain amount of traffic (~500,000 within date range)? Neither did I, until we ran into it whilst building Distilled ODN.

Similar problems exist with many “Search Analytics” tools. My colleague Sam Nemzer has written a bunch about this — did you know that most rank tracking platforms report completely different rankings? Or how about the fact that the keywords grouped by Google (and thus tools like SEMRush and STAT, too) are not equivalent, and don’t necessarily have the volumes quoted?

It’s important to understand the strengths and weaknesses of tools that we use, so that we can at least know when they’re directionally accurate (as in, their insights guide you in the right direction), even if not perfectly accurate. All I can really recommend here is that skilling up in SEO (or any other digital channel) necessarily means understanding the mechanics behind your measurement platforms — which is why all new starts at Distilled end up learning how to do analytics audits.

One of the most common solutions to the root problem is combining multiple data sources, but…

4. Combining data sources

There are numerous platforms out there that will “defeat (not provided)” by bringing together data from two or more of:

  • Analytics
  • Search Console
  • AdWords
  • Rank tracking

The problems here are that, firstly, these platforms do not have equivalent definitions, and secondly, ironically, (not provided) tends to break them.

Let’s deal with definitions first, with an example — let’s look at a landing page with a channel:

  • In Search Console, these are reported as clicks, and can be vulnerable to heavy, invisible sampling when multiple dimensions (e.g. keyword and page) or filters are combined.
  • In Google Analytics, these are reported using last non-direct click, meaning that your organic traffic includes a bunch of direct sessions, time-outs that resumed mid-session, etc. That’s without getting into dark traffic, ad blockers, etc.
  • In AdWords, most reporting uses last AdWords click, and conversions may be defined differently. In addition, keyword volumes are bundled, as referenced above.
  • Rank tracking is location specific, and inconsistent, as referenced above.

Fine, though — it may not be precise, but you can at least get to some directionally useful data given these limitations. However, about that “(not provided)”…

Most of your landing pages get traffic from more than one keyword. It’s very likely that some of these keywords convert better than others, particularly if they are branded, meaning that even the most thorough click-through rate model isn’t going to help you. So how do you know which keywords are valuable?

The best answer is to generalize from AdWords data for those keywords, but it’s very unlikely that you have analytics data for all those combinations of keyword and landing page. Essentially, the tools that report on this make the very bold assumption that a given page converts identically for all keywords. Some are more transparent about this than others.

Again, this isn’t to say that those tools aren’t valuable — they just need to be understood carefully. The only way you could reliably fill in these blanks created by “not provided” would be to spend a ton on paid search to get decent volume, conversion rate, and bounce rate estimates for all your keywords, and even then, you’ve not fixed the inconsistent definitions issues.

Bonus peeve: Average rank

I still see this way too often. Three questions:

  1. Do you care more about losing rankings for ten very low volume queries (10 searches a month or less) than for one high volume query (millions plus)? If the answer isn’t “yes, I absolutely care more about the ten low-volume queries”, then this metric isn’t for you, and you should consider a visibility metric based on click through rate estimates.
  2. When you start ranking at 100 for a keyword you didn’t rank for before, does this make you unhappy? If the answer isn’t “yes, I hate ranking for new keywords,” then this metric isn’t for you — because that will lower your average rank. You could of course treat all non-ranking keywords as position 100, as some tools allow, but is a drop of 2 average rank positions really the best way to express that 1/50 of your landing pages have been de-indexed? Again, use a visibility metric, please.
  3. Do you like comparing your performance with your competitors? If the answer isn’t “no, of course not,” then this metric isn’t for you — your competitors may have more or fewer branded keywords or long-tail rankings, and these will skew the comparison. Again, use a visibility metric.

Conclusion

Hopefully, you’ve found this useful. To summarize the main takeaways:

  • Critically analyse correlations & case studies by seeing if you can explain them as coincidences, as reverse causation, as joint causation, through reference to a third mutually relevant factor, or through niche applicability.
  • Don’t look at changes in traffic without looking at the context — what would you have forecasted for this period, and with what margin of error?
  • Remember that the tools we use have limitations, and do your research on how that impacts the numbers they show. “How has this number been produced?” is an important component in “What does this number mean?”
  • If you end up combining data from multiple tools, remember to work out the relationship between them — treat this information as directional rather than precise.

Let me know what data analysis fallacies bug you, in the comments below.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!


Moz Blog

Posted in Latest NewsComments Off

Advert