There are several explanations you would possibly require to discover all the URLs on a website, but your precise purpose will identify Everything you’re trying to find. For instance, you might want to:
Discover every single indexed URL to analyze issues like cannibalization or index bloat
Obtain latest and historic URLs Google has viewed, especially for web site migrations
Obtain all 404 URLs to Get well from article-migration mistakes
In Each and every situation, just one Instrument gained’t Supply you with all the things you require. Regretably, Google Research Console isn’t exhaustive, in addition to a “internet site:case in point.com” look for is proscribed and hard to extract info from.
During this post, I’ll wander you thru some instruments to make your URL listing and just before deduplicating the data employing a spreadsheet or Jupyter Notebook, determined by your web site’s measurement.
Aged sitemaps and crawl exports
In case you’re on the lookout for URLs that disappeared with the live internet site recently, there’s an opportunity someone in your staff can have saved a sitemap file or a crawl export ahead of the variations ended up created. In the event you haven’t by now, check for these data files; they might generally present what you may need. But, in the event you’re studying this, you probably didn't get so Blessed.
Archive.org
Archive.org
Archive.org is an invaluable Software for Website positioning duties, funded by donations. In the event you look for a domain and choose the “URLs” alternative, you could access approximately ten,000 mentioned URLs.
On the other hand, there are a few limits:
URL Restrict: You are able to only retrieve as much as web designer kuala lumpur ten,000 URLs, which happens to be insufficient for larger web sites.
Good quality: Quite a few URLs can be malformed or reference source files (e.g., pictures or scripts).
No export choice: There isn’t a constructed-in way to export the listing.
To bypass the lack of an export button, utilize a browser scraping plugin like Dataminer.io. Nonetheless, these constraints necessarily mean Archive.org may well not provide a complete Answer for more substantial internet sites. Also, Archive.org doesn’t reveal whether Google indexed a URL—however, if Archive.org discovered it, there’s a great chance Google did, way too.
Moz Professional
Even though you could possibly generally use a url index to find exterior internet sites linking for you, these applications also uncover URLs on your website in the process.
The best way to utilize it:
Export your inbound hyperlinks in Moz Professional to acquire a rapid and easy listing of focus on URLs from your internet site. In case you’re working with a large Web site, consider using the Moz API to export information past what’s workable in Excel or Google Sheets.
It’s vital that you Notice that Moz Pro doesn’t ensure if URLs are indexed or found out by Google. On the other hand, considering the fact that most web-sites use exactly the same robots.txt regulations to Moz’s bots since they do to Google’s, this technique generally operates effectively like a proxy for Googlebot’s discoverability.
Google Lookup Console
Google Search Console offers quite a few precious sources for setting up your list of URLs.
One-way links reviews:
Comparable to Moz Professional, the Backlinks part provides exportable lists of concentrate on URLs. Regrettably, these exports are capped at 1,000 URLs Each individual. You'll be able to use filters for certain pages, but considering the fact that filters don’t utilize on the export, you could possibly have to depend on browser scraping tools—limited to 500 filtered URLs at any given time. Not great.
Effectiveness → Search engine results:
This export provides an index of web pages getting search impressions. Though the export is proscribed, you can use Google Lookup Console API for more substantial datasets. You will also find cost-free Google Sheets plugins that simplify pulling much more comprehensive data.
Indexing → Internet pages report:
This portion provides exports filtered by problem type, though they are also restricted in scope.
Google Analytics
Google Analytics
The Engagement → Web pages and Screens default report in GA4 is an excellent source for collecting URLs, that has a generous limit of 100,000 URLs.
Even better, you are able to utilize filters to develop distinctive URL lists, proficiently surpassing the 100k Restrict. Such as, if you want to export only site URLs, comply with these measures:
Step one: Insert a segment towards the report
Move 2: Simply click “Produce a new phase.”
Action 3: Determine the segment with a narrower URL sample, for instance URLs containing /blog/
Take note: URLs located in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they supply beneficial insights.
Server log documents
Server or CDN log files are Most likely the ultimate Device at your disposal. These logs seize an exhaustive list of every URL route queried by users, Googlebot, or other bots over the recorded period.
Factors:
Facts sizing: Log files could be substantial, lots of web-sites only keep the last two weeks of data.
Complexity: Examining log data files can be difficult, but numerous tools are offered to simplify the procedure.
Merge, and excellent luck
After you’ve gathered URLs from these sources, it’s time to mix them. If your internet site is small enough, use Excel or, for larger sized datasets, equipment like Google Sheets or Jupyter Notebook. Make sure all URLs are continuously formatted, then deduplicate the checklist.
And voilà—you now have an extensive listing of present, outdated, and archived URLs. Good luck!