How To Fix Historical Redirects With The Wayback Machine APIs

Image result for single tip to improve the way they do SEO

What would you answer if a well-known company asked you to give them a single tip to improve the way they do SEO? What if you had to decide which action should be taken to get the most results for their SEO campaign without a large investment of time and resources?

If someone asked me this question, I would say that fixing historical redirects should be a priority and a SEO Miami can attest to that. There are always old links that redirect to the wrong pages, redirects that stopped working or that were simply dropped over time. These redirects still exist and are sending users to the wrong pages or showing error messages. This is something that a lot of websites aren’t tapping in and they are missing out on something valuable.

I have been doing SEO for years and I have found that fixing old redirects is probably one of the most efficient things that can be done. Redirects are not fun to work with and a lot of companies simply forget them over the tyears. Most companies do not have a process to keep track of their redirects and to make sure that they are updated when needed.

I love what I do. I love being able to show massive improvements to my clients and to resolve issues they have been dealing with for a while. Fixing redirects allows me to do that with just a few hours of my time. This is an excellent way to make a positive impression on a client and to show them how much I can do to help them.

Finding Lost Links

I used the Internet Archive’s Wayback Machine to find some pages from the previous version of the site and fixed the redirects. I was able to find historical redirects that this company hadn’t tracked and helped this company double its traffic within a month.

Fixing historical redirects makes a huge difference and I think this is something that can help a company perform better than a competitor in a very short amount of time.

I recovered number of different referring domains. If I normally saw this kind of traffic in the Ahrefs, I would assume that it was spam. However, this is the difference that receovering historical redirects made. All I had to do was recover old pages from previous versions of a site.

Fixing a few redirects is a matter of minutes while developing new links via a link-building campaign is a more time-consuming process.

Google Analytics data and SEMrush showed that traffic went up by a little more than 30% within a month.

Wayback Machine CDX Server Or Screaming Frog?

I used Screaming Frog to crawl the Wayback Machine in my previous post. This is the most efficient and comprehensive way to retrieve URIs since this method allows you to crawl every page present in the archive. The output is clean and you get a comprehensive picture. However, this isn’t an efficient way to work since the limitations do not allow you to do this with larger sites.

I talked about this at SMX Advanced and maybe fifty people came up and asked questions after my presentation. They wanted to know more about a bonus tip I had shared. I was a little disappointed that the rest of the presentation didn’t generate the same response. Everyone wanted to know more about pulling URIs from the Wayback Machine to easily find and fix historial redirect.

Using The Wayback Machine CDX Server

Image result for How To Fix Historical Redirects With The Wayback Machine APIs

You can visit the GitHub for the Wayback Machine and find the full documentation for their CDX Server Api.

You can enter a basic query for a website:

web.archive.org/cdx/search/cdx?url=yourwebsite.com

However, it is best to use something more detailed like this:

web.archive.org/cdx/search/cdx?url=yourwebsite.com&matchType=domain&fl=original&collapse=urlkey&limit=5000

Here is how I came up with this more detailed query

&matchType=domain means I want results from all domains and subdomains.
&fl=original asks specifically for the URIs and not for any other information I don’t need.
&collapse=urlkey asks for unique listings so there won’t be any duplicates in the results.
&limit=5000 limits the results shown so there aren’t more than 5,000 rows.

I have also been using the Resumption Key a lot. This feature allows you to create large queries and to continute where you left off. I sometimes use the Pagination APi for larger queries so I can easily break data into chunks. You can find out more about these features by looking at the documentation.

The Regex filtering has been useful too. I can use this feature to get rid of files I don’t need. I can for instance get rid of jss, css or ico files. I can use this to filter out all the image formats if I don’t need to see this type of content.

How To Clean Up The Output

There is more work to do. Using a query like the one listed above should help you get a clean output, but you will still see a lot of elements you don’t need. You might see some feeds, images and robots.txt files. You can filter all these elements out. You can also clean up the output by getting rid of campaign tags, parameters, ports or characters that haven’t been converted from UTF-8.

You can use the different CDX Server filters to clean up the output or simply use a spreadsheet to format the output in function of what you need to see. This might sound complicated but it isn’t. There are countless ways to clean up and format the output and you will need to find a method that works for what you need to do. You should test a combination of filters or just clean up the output manually until you get a clean list of all the old pages.

How To Check The Redirects

Now that you have a list of URIs, you need to check these old pages. I use Screaming Frog to do this, and you can check the guide created by Dan Sharp to find out more. If you find some 404 error, fix them and you will recover lost signals. Clean up the 301s and 302s but there might be a better option depending on what your goals are with these redirects.

Google often says that 302s and 301s are the same things. My experience has made me believe the contrary. John Mueller from Google said that 301s and 302s were the same thing, which prompted me to run a few tests between December and March. The tests showed some dropoffs with the 302s that had multiple hops.

Illyes sent the tweet you can see above and I did more tests. I can’t really draw a conclusion yet since it has only been a month and I need to do a more thorough analysis. I think that 301s and 302s might be the same thing now, but this might only apply to Google. However, it is best to clean up the redirect chains and to only use 301s since this is the best practice.

To Conclude On Historical Redirects

There are different ways to find and fix historical pages and their redirects. The only thing that matters is that you do it. It has been my experience that this is the most efficient way to boost traffic and improve SEO. You should also implement some best practices to prevent old redirects from being forgotten again and to preserve the value of the links you recovered.

It is difficult to determine the impage this will have since it really depends on the website you are working with. There won’t be any old redirects to recover for a recent website, but fixing historical redirects would make a huge difference for a website that has been around for years and that has been redesigned several times. The results depend on how many redirects were lost.

Update your disavow file and monitor new incoming links if you are re-activating old redirects that you know were targeted by spam in the past!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.