HOW TO FIND ALL EXISTING AND ARCHIVED URLS ON A WEB SITE

How to Find All Existing and Archived URLs on a web site

How to Find All Existing and Archived URLs on a web site

Blog Article

There are numerous motives you could possibly need to find all the URLs on an internet site, but your exact purpose will determine That which you’re trying to find. For example, you might want to:

Identify every indexed URL to investigate challenges like cannibalization or index bloat
Accumulate recent and historic URLs Google has observed, specifically for internet site migrations
Find all 404 URLs to Recuperate from article-migration mistakes
In Each individual scenario, just one Instrument gained’t Provide you every thing you'll need. Regrettably, Google Search Console isn’t exhaustive, along with a “site:case in point.com” research is limited and challenging to extract facts from.

In this particular submit, I’ll stroll you thru some equipment to construct your URL listing and just before deduplicating the info utilizing a spreadsheet or Jupyter Notebook, dependant upon your web site’s sizing.

Aged sitemaps and crawl exports
In the event you’re trying to find URLs that disappeared in the Stay internet site not too long ago, there’s a chance somebody on your own team might have saved a sitemap file or simply a crawl export ahead of the changes were made. If you haven’t currently, look for these information; they're able to typically offer what you need. But, for those who’re looking through this, you almost certainly did not get so Fortunate.

Archive.org
Archive.org
Archive.org is a useful Software for Search engine optimization duties, funded by donations. In case you seek for a website and choose the “URLs” solution, you may access as many as 10,000 stated URLs.

On the other hand, There are many limits:

URL Restrict: You'll be able to only retrieve nearly web designer kuala lumpur ten,000 URLs, which can be insufficient for much larger sites.
Quality: Many URLs may be malformed or reference useful resource information (e.g., photos or scripts).
No export choice: There isn’t a designed-in approach to export the list.
To bypass The dearth of an export button, make use of a browser scraping plugin like Dataminer.io. However, these limitations imply Archive.org may well not provide an entire Option for much larger web-sites. Also, Archive.org doesn’t reveal irrespective of whether Google indexed a URL—however, if Archive.org uncovered it, there’s a very good opportunity Google did, far too.

Moz Pro
Even though you may generally make use of a url index to search out exterior web-sites linking for you, these resources also uncover URLs on your site in the method.


The best way to utilize it:
Export your inbound back links in Moz Professional to obtain a speedy and straightforward listing of concentrate on URLs from your web-site. Should you’re working with a large website, consider using the Moz API to export info outside of what’s workable in Excel or Google Sheets.

It’s imperative that you Take note that Moz Professional doesn’t validate if URLs are indexed or uncovered by Google. Even so, due to the fact most web sites implement precisely the same robots.txt principles to Moz’s bots as they do to Google’s, this method typically will work perfectly to be a proxy for Googlebot’s discoverability.

Google Look for Console
Google Look for Console features quite a few important sources for developing your listing of URLs.

Backlinks experiences:


Much like Moz Pro, the One-way links segment supplies exportable lists of concentrate on URLs. Regrettably, these exports are capped at 1,000 URLs Each individual. You'll be able to apply filters for certain webpages, but considering that filters don’t implement for the export, you could possibly must count on browser scraping resources—limited to five hundred filtered URLs at any given time. Not suitable.

Effectiveness → Search engine results:


This export offers you a summary of internet pages receiving search impressions. Even though the export is proscribed, You should use Google Search Console API for greater datasets. There's also no cost Google Sheets plugins that simplify pulling additional extensive facts.

Indexing → Webpages report:


This part provides exports filtered by situation form, although these are typically also confined in scope.

Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is a wonderful supply for gathering URLs, using a generous Restrict of one hundred,000 URLs.


Better still, you are able to utilize filters to generate various URL lists, correctly surpassing the 100k limit. Such as, if you'd like to export only web site URLs, adhere to these methods:

Action 1: Incorporate a segment for the report

Stage 2: Simply click “Develop a new segment.”


Move 3: Determine the section having a narrower URL sample, including URLs made up of /website/


Take note: URLs found in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide valuable insights.

Server log information
Server or CDN log information are Most likely the final word Software at your disposal. These logs seize an exhaustive listing of every URL route queried by users, Googlebot, or other bots throughout the recorded period.

Concerns:

Data sizing: Log information may be significant, a great number of web sites only keep the final two months of knowledge.
Complexity: Examining log data files can be demanding, but many resources can be found to simplify the process.
Combine, and good luck
When you’ve collected URLs from all of these sources, it’s time to mix them. If your website is sufficiently small, use Excel or, for more substantial datasets, tools like Google Sheets or Jupyter Notebook. Be certain all URLs are continually formatted, then deduplicate the record.

And voilà—you now have an extensive listing of current, old, and archived URLs. Excellent luck!

Report this page