1 Introduction

With Google evaluating sites based on various ranking factors, knowing on which ranking factors to focus on your SEO strategy for the biggest bang is crucial.

Several large-scale data studies, mainly conducted by SEO vendors, have sought to uncover the relevance and importance of certain ranking factors*. However, in our view, the studies contain major statistical flaws. For example, the use of correlation statistics as the main instrument may render results that are misleading in the presence of outliers or non-linear associations.

Considering the methodological issues and the lack of certain ranking factors, there is a need for rock solid data formatted into clear takeaways.

1.1 Methodology

  • Step 1 Ahrefs Raw Data: As a data partner, Ahrefs provided the raw data for the analysis. The data contained 1,183,680 keywords (1,183,628 after data cleaning, for details see below) with a total number of 11,835,086 ranking (10,052,136 unique URL’s; 10,052,028 after data cleaning).

  • Step 3 Data Mining: We developed a data-mining script to gather data on various variables. More specificially, we collected data on Schema.org Usage, Word Count, Title, H1, Broken Links and Page Size (HTML). Due to anti-mining mechanisms, authoritative domains such as Amazon.com or youtube.com were not considered (see Section 1.2 for the number of observations we excluded for each domain). In the forthcoming sections, we refer to those as “Large Domains”. No data could be extracted for roughly 6% of the URLs due to server response errors. In total, we mined data from about 7,633,169 URLs.

  • Step 4 APIs and external data sources: In addition, the Alexa API was used to collect domain level data on the Time-on-Site and Page Speed variables. Furthermore, Clearscope.io, another data partner, collected “content scores” on 1000 high-search volume keywords (see Section 2.3.3 for a detailed explanation)

  • Step 4 Data Analysis: The data has been analysed and processed for selected features to showcase whether they have a positive or negative trend on Google Ranking Positions. Polynomial regression has been applied to all numeric variables. In some cases, linear regression has been used (e.g. URL length) to provide simple average trends.

A note on chart types:

We are using three types of charts to represent the data and the trends among positions that may be considered as “non-traditional” charts. Here some notes how to read them and why we think they are helpful.

  1. Multiple probability intervals (“distribution stripes”):
    • The plot shows in a simple way how the data is distributed and allows to compare easily the distributions within and among position.
    • For each position, several bars of different color are drawn that contain X% of the values, starting at the median (somewhere in the 5% area)
    • The dark(er) stripes are basically a visual fitting and allows us to determine if there is any change in the metric with position or between large domains and other.
    • In some cases, only one or two bars are contained in the plot - this is due to the fact that lower percentages of the data do fall in very limited range of the metric, thus being invisible for our eyes.
    • Possible adjustments: Of course, we can change the number of levels (currently 6 for the data exploration and to give you the chance to choose) and their thresholds. So if you, for example, think 6 are too many and no one is interested where 25% or 75% of the data are - fine, than we create these plots with 4 levels: 5%, 50%, 95%, 100%.
  2. Point intervals with polynomial or linear fitting:
    • The plot shows, a bit similar to the distribution stripes, where the majority of data sits.
    • The dot reperesents the median value, the thick line 50% of the data and the thin line 95% of the data.
    • In some cases, only a dot can be seen - this is beacuse more than 95% of the data are equal or close to the median (and thus the lines lay behind the dot). Due to some outliers, the fitting might look a bit off (but have a closer look on the axes ranges of median and maximum - often the trend is neglectible anyway.)
    • Possible adjustments: Similar to the chart type above, we can change the interval ranges (now 50% and 95%) to any you like and also reduce or increase their number.
  3. Diverging range plot:
    • Due to the complexity of the data (domain, metric and change in position), we have tried a new plot type that shows the median value of the metric (black dot) and the range from minimum to maximum (segments).
    • The colors indicate if the values above and below the average, respectively, belong to a lower or higher average position.
    • This plot shows in a simple way which large domains have considerably higher/lower medians and/or ranges plus if an increase or decrease in that metric is correlated with an increase of average position (all or most segments right from the dots are backlino-cyan) or with an decrease (all or most segements right from the dots are purple).
    • (Detailed methods: For each domain, we calculated the mean. For each URL we assigned if the scored lower or higher than the average. Afterwards we’ve compared these means of position and colored the segments according. Example: a mean of 5, mostly URL’s on low positions score low and those on high positions high. Thus, the mean position of URL’s above the mean will be greater than the mean of URL’s below the median.)
    • Possible adjustments: Again, the length of the segments could be adjusted to represent 50%, 95% or 37.6278% of the data.

A note on the fittings (visualised in the point-range plots): - Compared to simple linear regression, polynomial fittings are a great way to capture more complex patterns in the data. However, it makes it more difficult to put hard numbers on them (since it’s not lineary scaled as 1% more -> 1 position more). - In case the polynomial fiting was close to the outcome of a simple linear regression, we used a linear regression instead to reduce complexity and provide simple, linearly scaled lifting numbers. - In some cases, the fitting does not have much explanatory power, so we decided to not include models in all cases and/or state this prominently (referring to a low R^2 for example). - Please keep in mind that several of the fittings can be misleading and/or are not or only vaguely supported. Often, the trends are driven by some URL’s that have very extreme values compared to the majority (95% or even more of the data). However, correlation does not mean causality so the reason is likely not the metric driven the pattern but other factors leading to some URL’s with extreme values scoring best (see for example backlinks and referring domains). - Possible adjustments: + In any case, it is possible to exclude such outliers and calculate the linear fitting/lifting numbers for, let’s say, the top 95% of the data of each position. + Depending on the time left, another option would be generalized linear (mixed) effect models. With this advanced type of regression model, we would likely be able to fit a range of explanatory variables/metrics to see how they affect the response variable “position”. This way, we could directly determine the (relative) effect on the response variable and dig a bit deeper than investigating the effect/trend correlation of each variable on it’s own. Possbile drawbacks could be here (i) the sheer amount of data which may cause problems when fitting the model; (ii) the correlation between explanatory variables that leads to exclusion of some variables (otherwise, effects would be “masked”) - examples here would be here backlings and referring domains, exact and partial anchor matches and likely some more; (iii) potential problems with the prerequisites needed for the model which could lead to an iteration of model runs and adjustments to find the best data transformation for each variable.

1.2 Cleaning the Data: What Information Do We Keep for Analysis?

Ahrefs Data

  • In some keywords there are less than 10 ranking URL’s → We removed 52 keywords that contained less than 5 positions.

  • Metrics provided:

    • domain rating (Domain_rating) → 11,834,947 values

    • URL rating (URL_rating) → 11,834,932 values

    • number of backlinks (backlinks) → 11,834,947 values

    • number of referring domains (refdomains) → 11,834,947 values

    • exact match (perc_exact_matches) → 11,834,969 values

    • partial match (perc_partial_matches) → 11,834,969 values

    • URL length (perc_partial_matches) → 11,834,969 values

  • Some metrics contain NA values:

    • Domain rating: 22 missing values

    • URL rating: 37 missing values number of backlinks: 22 missing values

    • Number of referring domains: 22 missing values