Identify all indexable pages.

Learn, share, and connect around europe dataset solutions.
Post Reply
kexej28769@nongnue
Posts: 227
Joined: Tue Jan 07, 2025 4:41 am

Identify all indexable pages.

Post by kexej28769@nongnue »

Identify the best performing pages.
How to crawl a legacy site.
Crawl the old website so that you have a copy of all URLs, page titles, metadata, headers, redirects, broken links, etc. Regardless of the crawler's preferred request ( see Appendix ), make sure that the crawl is not too restrictive. Before crawling a legacy site, pay close attention to the crawler's settings and consider whether you need to:

Ignore robots.txt (if a critical part is accidentally blocked)
Follow internal "nofollow" links (so that the crawler reaches more pages)
Crawl all subdomains (depending on scope)
Crawl outside the start folder (depending on scope)
Change user agent to Google Bot (Desktop)
Change the user agent to Google Bot (smartphone).
Pro tip: Keep a copy of the old argentina number data site's crawl data (on file or in the cloud) for several months after the migration is complete, just in case you need any of the old site data after the new site goes live.

How to identify indexable pages.
Once the crawl is complete, work on identifying the indexed pages of the legacy site. These are HTML pages with the following characteristics:

Return a 200 server response.
Either there is no canonical tag or there is a self-referencing canonical URL.
Meta robots are not noindex.
Post Reply