With Google evaluating sites based on various ranking factors, knowing on which ranking factors to focus on your SEO strategy for the biggest bang is crucial.
Several large-scale data studies, mainly conducted by SEO vendors, have sought to uncover the relevance and importance of certain ranking factors*. However, in our view, the studies contain major statistical flaws. For example, the use of correlation statistics as the main instrument may render results that are misleading in the presence of outliers or non-linear associations.
Considering the methodological issues and the lack of certain ranking factors, there is a need for rock solid data formatted into clear takeaways.
Step 1 Ahrefs Raw Data: As a data partner, Ahrefs provided the raw data for the analysis. The data contained 1,183,680 keywords (1,183,628 after data cleaning, for details see below) with a total number of 11,835,086 ranking (10,052,136 unique URL’s; 10,052,028 after data cleaning).
Step 3 Data Mining: We developed a data-mining script to gather data on various variables. More specificially, we collected data on Schema.org Usage, Word Count, Title, H1, Broken Links and Page Size (HTML). Due to anti-mining mechanisms, authoritative domains such as Amazon.com or youtube.com were not considered (see Section 1.2 for the number of observations we excluded for each domain). In the forthcoming sections, we refer to those as “Large Domains”. No data could be extracted for roughly 6% of the URLs due to server response errors. In total, we mined data from about 7,633,169 URLs.
Step 4 APIs and external data sources: In addition, the Alexa API was used to collect domain level data on the Time-on-Site and Page Speed variables. Furthermore, Clearscope.io, another data partner, collected “content scores” on 1000 high-search volume keywords (see Section 2.3.3 for a detailed explanation)
Step 4 Data Analysis: The data has been analysed and processed for selected features to showcase whether they have a positive or negative trend on Google Ranking Positions. Polynomial regression has been applied to all numeric variables. In some cases, linear regression has been used (e.g. URL length) to provide simple average trends.
A note on chart types:
We are using three types of charts to represent the data and the trends among positions that may be considered as “non-traditional” charts. Here some notes how to read them and why we think they are helpful.
A note on the fittings (visualised in the point-range plots): - Compared to simple linear regression, polynomial fittings are a great way to capture more complex patterns in the data. However, it makes it more difficult to put hard numbers on them (since it’s not lineary scaled as 1% more -> 1 position more). - In case the polynomial fiting was close to the outcome of a simple linear regression, we used a linear regression instead to reduce complexity and provide simple, linearly scaled lifting numbers. - In some cases, the fitting does not have much explanatory power, so we decided to not include models in all cases and/or state this prominently (referring to a low R^2 for example). - Please keep in mind that several of the fittings can be misleading and/or are not or only vaguely supported. Often, the trends are driven by some URL’s that have very extreme values compared to the majority (95% or even more of the data). However, correlation does not mean causality so the reason is likely not the metric driven the pattern but other factors leading to some URL’s with extreme values scoring best (see for example backlinks and referring domains). - Possible adjustments: + In any case, it is possible to exclude such outliers and calculate the linear fitting/lifting numbers for, let’s say, the top 95% of the data of each position. + Depending on the time left, another option would be generalized linear (mixed) effect models. With this advanced type of regression model, we would likely be able to fit a range of explanatory variables/metrics to see how they affect the response variable “position”. This way, we could directly determine the (relative) effect on the response variable and dig a bit deeper than investigating the effect/trend correlation of each variable on it’s own. Possbile drawbacks could be here (i) the sheer amount of data which may cause problems when fitting the model; (ii) the correlation between explanatory variables that leads to exclusion of some variables (otherwise, effects would be “masked”) - examples here would be here backlings and referring domains, exact and partial anchor matches and likely some more; (iii) potential problems with the prerequisites needed for the model which could lead to an iteration of model runs and adjustments to find the best data transformation for each variable.
In some keywords there are less than 10 ranking URL’s → We removed 52 keywords that contained less than 5 positions.
Metrics provided:
domain rating (Domain_rating
) → 11,834,947 values
URL rating (URL_rating
) → 11,834,932 values
number of backlinks (backlinks
) → 11,834,947 values
number of referring domains (refdomains
) → 11,834,947 values
exact match (perc_exact_matches
) → 11,834,969 values
partial match (perc_partial_matches
) → 11,834,969 values
URL length (perc_partial_matches
) → 11,834,969 values
Some metrics contain NA values:
Domain rating: 22 missing values
URL rating: 37 missing values number of backlinks: 22 missing values
Number of referring domains: 22 missing values
We have also a lot of large domains that we did not scrape and we also compare the trends for both large domains and all other URL’s. The following domains were classified as large domains:
Domain | Count |
---|---|
en.wikipedia.org | 314,512 |
youtube.com | 299,564 |
amazon.com | 295,468 |
facebook.com | 230,974 |
pinterest.com | 144,683 |
yelp.com | 140,694 |
tripadvisor.com | 82,815 |
ebay.com | 76,819 |
reddit.com | 70,614 |
linkedin.com | 69,778 |
twitter.com | 66,438 |
walmart.com | 65,094 |
imdb.com | 63,496 |
yellowpages.com | 47,135 |
mapquest.com | 43,779 |
quora.com | 41,583 |
etsy.com | 40,675 |
target.com | 29,727 |
instagram.com | 29,634 |
9,681,487 URL’s were classified as other domains.
In this section, we analyse how different ranking factors relate with higher organic positions in the Search Engine Results Pages (SERPs).
More specifiically, we look at following factors:
(Note: Logarithmic scale (log10) on the x axis.)
(Note: Logarithmic scale on both the x and y axis.)
Key takeaways:
The majority of URL’s contain no backlinks at all (more than 95% of all URL’s).
This pattern is independent form position (see additional plots below).
Due to the highly skewed data, any trend found has to be treated with caution - a few URL’s drive the pattern.
Key takeaways:
More than 95% of all URL’s do not contain any backlinks (only light dots and green bars, respectively, the lines/all other bars sit exactly at 0 and are thus not visible). This pattern is independent from position.
Also, the maximum range appears to be independent from position since some URL’s containing more than 20M backlinks can be found on rank 2, 3, 5, 6 and 9.
Given URL’s that contain millions of backlinks, the trend does not seem to be relevant. seems to be uninteresting.
The fitting accounts also for a few URL’s with very high values (not contained in the major 95%), thus it looks a bit off.
In a next step, we compare domains classified as large with all others.
Key takeaways:
Large domains contain mostly a low number of backlinks with an distinct peak above 200K on #2 ( a en.wikipedia.org URL)
Other domains contain also in almost all cases no backlinks, but the ranges are far higher, sometimes exceeding 20M.
To investigate the patterns in more detail, we split the group of large domains into each domain on it’s own. This way we can see that Wikipedia contains many more backlinks than any other large domain and drives the pattern we have seen in the plot before.
Key takeaways:
Wikipedia contains remarkably more backlinks (47.2) on average than any other of the large domains (0.026).
In almost all cases, more backlinks result in higher average position (most bars pointing to the right are colored backlinko-cyan).
Unfortunately, most URLs do not contain any backlinks at all. In a follow-up step, we had a look at all URLs that contaiend at least one backlink.
Key takeaways:
Key takeaways:
When URL’s without any backlinks were excluded:
Top ranked URL’s contain more backlinks than lower ranked URL’s.
URL’s ranked #1 and #2 contain approx. 3.8 and 2 times, respectively, more backlinks than lower ranked ones.
Large domains contain cosiderably more backlinks than URL’s of other domains (median of 170 for large domains versus 5 for others).
(Note: Logarithmic scale on the x axis.)
(Note: Logarithmic scale on both the x and the y axis.)
Key takeaways:
Almost all URL’s do not cotain any referring domains.
This pattern is independent form position (see additional plots below).
Due to the highly skewed data, any trend found has to be treated with caution - a few URL’s with millions of backlinks drive the pattern.
Key takeaways:
The number of referring domains show the same pattern as backlinks with more than 95% of URLs containing no referring domains at all (only dots in the point interval, only light bars in the distirbution stripes).
Again, the maximum range seems not to correlate with position (if than more referring domains are found for URL’s on higher positions, but we will look at this later in more detail).
The trend seems obvious but is not any trend at all - there is a difference of approx. 0.5 referring domains between #1 and lower positions!
Key takeaways:
On average, large domains contain 0.968 referring domains while all other URL’s contain 0.079.
Large domains contain 10,155 at a max while all other URL’s reach a maximum number of 7,414 referring domains. There seems to be no pattern of maximum range wih position.
Unfortunately, most URLs do not contain any referring domains at all. In a follow-up step, we had a look at all URLs that contaiend at least one backlink.
Key takeaways:
When URL’s without any backlinks were excluded:
Top ranked URL’s contain more reffering domains than those that are ranked lower.
URL’s ranked #1 and #2 contain approx. 3.2 and 2 times, respectively, more backlinks than lower ranked ones.
Large domains contain cosiderably more referring domains than URL’s of other domains (median of 62 versus 3).
Key takeaways:
Key takeaways:
Average and median domain rating increase with better position.
#2 has the highest average and median rating (74.3 and 84)
Median above 80 for #1-#4, exactly 80 for #5 and #6, and below 80 for all lower ranked URL’s (maximum median of 84 for #2, minimum median of 76 for #10)
URL’s of all positions cover the whole range from 0 to 100.
Key takeaways:
Large domains have remarkably higher average and median domain ratings (mean of 95.4 and median of 95) compared to all other domains (65.2 and 75).
The range of ratings is very narrow for large domains while it is much larger in all other.
Again, the whole range of possible ratings is covered in all cases.
Key takeaways:
Large domains have quite similar average ratings ranging between 89 and 89.
facebook.com has an average domain rating of 100, closely followed by the social media plattforms (twitter.com: 99, linkedin.com: 98, youtube.com: 98, instagram.com: 98, pinterest.com: 97) and en.wikipedia.org (95) and amazon.com (95).
The lowest score of all large domains have yellowpages.com (89), target.com (90) and walmart.com (90)
In general, lower ratings correlate with lower positions for most large domains with the exception of ebay.com and walmart.com.
(Note: Logarithmic scale on the x axis.)
Key takeaways:
On a logarithmic sale, Alexa’s daily time-on-site measure is distributed normally with a mean of 197.7 seconds.
The range covers below 10 and more than 10.000 seconds (~167 minutes).
(Note: We excluded the 100% range here to make the pattern better visible. A plot containing the 100% data as well can be found after the key takeaways.)
Key takeaways:
Median page speed is 1.65 seconds. This pattern is independent from position.
Also the range of speeds does not differ among positions.
Most reported page speeds are below 5,000 milliseconds (5 seconds) but a few URL’s are remarkably slower with speeds up to 7M milliseconds (= 1.9 hours)
Note: Again, the trend is driven by a few outliers with very values of page speed - I would not conclude here that better-ranked URL’s are slower. More likely, a few heavy and slow pages that are often ranked on the top 3 (for other reasons than page speed) skew the trend. This also becomes obvious when looking at the trend of median (dots) which seems to increases (slightly) with better positioning. If you prefer, we can run a similar analysis on excluding the 5% outlier with slow speed.
Additional plot including 100% range:
(Note: Logarithmic scale on the x axis.)
Key takeaways:
On a logarithmic sale, Alexa’s daily time-on-site measure is distributed normally with a mean of 197.7 seconds.
The range covers below 10 and more than 10.000 seconds (~167 minutes).
(Note: Logarithmic scale on the x axis. A plot with a simple linea axis can be found below after the key takeaways.)
Key takeaways:
Key takeaways:
!!! Falscher Plot. Der angezeigte Plot ist für URL Rating
Key takeaways:
There is no correlation of page size with position.
Page size does not differ much among positions.
Most URL’s are quite small (around 100 units) and some few are very large up to and more than 60,000 (light-green bars in second plot)
The median page size is, mostly independent from position, around 94 (93.7 overall, range of 90.5 on #10 to 96.4 on #3).
Key takeaways:
Higher ocntent scores correlate with better positions.
On both devices, desktop and mobile, an increase of 1 in content score increase position by 1.
To focus a bit more on the main pattern, we keep only the major 50% URL’s of each position.
Key takeaways:
Key takeaways:
Note: In this case, the segments lay all behind the dot, meaning that more than 95% of the values is (at least, but actually exactly) zero. The fitting accounts also for a few URL’s with very high values (not contained in the major 95%), thus it looks a bit off. However, we discourage you from concluding anything from this plot since the change is too small to explain anything at all.
Key takeaways:
Key takeaways:
Key takeaways:
Median URL rating is almost similar across positions (range: 11-12).
URL’s on position 1 to 6 have a median of 12, lower ranked a median of 11.
URL rating is on average 11.2.
Key takeaways:
Large domains score also higher than other domains (median of 15 versus 11; mean of 14.8 versus 10.4).
Ranges of URL rating are in general very narrow for most URL’s, no matter if belonging to a large or other domains.
Key takeaways:
In general not much variation in average URL rating across positions.
Among the top large domains with regard to URL rating are social media plattforms (facebook.com, twitter.com, instagram.com, youtube.com, linkedin.com), en.wikipedia.org and amazon.com.
instagram.com and en.wikipedia.org have by for the highest URL ratings with values above 70%.
Key takeaways:
Key takeaways:
There is a wide range of URL lengths, some with more than 2000 characters (maximum of 2075 for #8).
Since the pattern is quite linear, we used a simple linear regression here: length of URL = 60.43 + 1.023 * position, but very low R2 of 0.01)
Average length of URL’s increase with lower ranking → URL’s on #10 are on average 9.2 characters shorter (70.7) than those on #1 (61.5).
In general, the ranges of URL length are quite the same among positions but the maximum length increases slightly with lower ranking as well. Especially #1 and #2 have low maxima compared to the other 8 positions.
The majority of data (> 95%) has relatively short URL names with an overall average of ~66 characters.
To see the trends in more detail, we can either use a log scale or use the same plot with the main 75% percent per position only.
Range of URL length is almost the same among positions with a slight trend with shorter URL’s on top positions.
Key takeaways:
Additional plots
Logarithmic scale:
Another way to look it bit more closely at differences in URL length.
Key takeaways:
URL’s of large domains are on average slightly shorter than other URL’s (60.8 characters versus 67.2 characters)
The decrease in average length is not as clear as for other domians with a increase of only 4.8 characters when comparing #1 and #10 (versus 10.9 characters for all other domains).
Key takeaways:
tripadvisor.com has the longest average URL’s (108.8957 characters), followed by reddit.com (85.80175 characters), walmart.com (83.84633 characters) and quora.com (77.97946 characters).
Some URL’s hosted by facebook.com (1,327 characters) and ebay.com (1,255 characters) consist of the most characters.
There is no clear pattern of URL length versus position; more often are longer URL’s correlated with lower position but not always.
Key takeaways:
Key takeaways:
Most URL’s (> 95%) contain between 100 and 10.000 words in their body with a median of 931 words.
The linear fitting predicts a tiny decrease in body word amount of 2.47 words by increasing position by 1 (linear model: word amount = 1461.1 - 2.47 * position).
Note: The support for the linear model is super low (R^2 < 0.00001)!
Key takeaways:
Most URL’s do not use schema markup (72.6%).
If there is any trend, than that #1 and #2 have slightly less often URL’s with schema markup (25.1% and 26.3%, all other positions between 27.2% and 28.1%).
Session Info
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.1.0.9000 forcats_0.4.0 stringr_1.4.0
## [4] dplyr_0.8.3 purrr_0.3.3 readr_1.3.1
## [7] tidyr_1.0.0 tibble_2.1.3 ggplot2_3.2.1
## [10] tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.12 haven_2.2.0 lattice_0.20-38
## [5] colorspace_1.4-1 vctrs_0.2.1 generics_0.0.2 viridisLite_0.3.0
## [9] htmltools_0.4.0 yaml_2.2.0 rlang_0.4.2 pillar_1.4.3
## [13] withr_2.1.2 glue_1.3.1 DBI_1.1.0 dbplyr_1.4.2
## [17] modelr_0.1.5 readxl_1.3.1 lifecycle_0.1.0 munsell_0.5.0
## [21] gtable_0.3.0 cellranger_1.1.0 rvest_0.3.5 evaluate_0.14
## [25] knitr_1.27 fansi_0.4.1 highr_0.8 broom_0.5.3
## [29] Rcpp_1.0.3 scales_1.1.0 backports_1.1.5 webshot_0.5.2
## [33] jsonlite_1.6 fs_1.3.1 hms_0.5.3 digest_0.6.23
## [37] stringi_1.4.5 rprojroot_1.3-2 grid_3.6.2 here_0.1
## [41] cli_2.0.1 tools_3.6.2 magrittr_1.5 lazyeval_0.2.2
## [45] crayon_1.3.4 pkgconfig_2.0.3 zeallot_0.1.0 xml2_1.2.2
## [49] reprex_0.3.0 lubridate_1.7.4 assertthat_0.2.1 rmarkdown_2.0
## [53] httr_1.4.1 rstudioapi_0.10 R6_2.4.1 nlme_3.1-142
## [57] compiler_3.6.2