How to Prepare Robots.txt and Sitemap?

The robots.txt file and XML sitemap are two essential technical SEO files that help search engines understand how to crawl your website and which pages they should discover. Robots.txt tells bots such as Googlebot which areas they may or may not crawl; a sitemap, also known as an XML sitemap, lists your important URLs, update dates, and site structure for search engines. In short: robots.txt guides crawling, while a sitemap speeds up discovery. A properly configured robots.txt and sitemap setup can significantly improve indexing efficiency, especially for new websites, e-commerce stores, corporate sites, and large content archives.

In this guide, we’ll walk through how to create a robots.txt file and an XML sitemap, which rules to use, what to watch out for on WordPress and custom-built websites, how to test for errors, and how to submit your sitemap to Google. Prepared for the Hostragons blog, this article is designed around 2026 SEO expectations, with a focus on search intent, technical accuracy, crawl budget, indexability, and practical implementation.

What Is Robots.txt?

Robots.txt is a plain-text file located in the root directory of your website. It is usually accessible at https://yourdomain.com/robots.txt. This file gives search engine bots instructions about which folders or pages can be crawled and which should not be crawled. The key point to understand is this: robots.txt is not a security tool. It is simply a crawl instruction for well-behaved bots.

For example, admin panels, cart steps, filter parameters, internal search results pages, or test directories can be blocked from search engine crawling. However, private information should never be “protected” with robots.txt. The file is public and can be viewed by anyone. Real security requires password protection, server-side access restrictions, secure hosting configuration, and SSL. For the basic security of your website, you can review SSL Certificate, and for a fast, reliable infrastructure, you can consider Web Hosting solutions.

What Does a Robots.txt File Do?

It guides how search engine bots crawl your website.
It helps reduce crawling of low-value or duplicate pages.
It helps reserve crawl budget for important pages.
It tells bots where your sitemap file is located.
It can prevent crawling of test areas, admin panels, internal search pages, and parameter-based URLs.

On websites with thousands of products, categories, tags, or filter pages, a poorly planned robots.txt file can cause Google to discover important pages later than expected. On the other hand, if your robots.txt file is too restrictive, it may block CSS, JavaScript, image files, or category pages, which can hurt your ranking performance.

What Is a Sitemap?

A sitemap, more precisely an XML sitemap, is a file that lists the important URLs on your website for search engines. It is commonly found at https://yourdomain.com/sitemap.xml. A sitemap sends this message to search engines: These pages matter to my website, please discover them and include the suitable ones in your indexing process.

A sitemap file may include information such as the URL, last modified date, change frequency, and priority. In the 2026 SEO landscape, the last modified date is especially important. Search engines want to discover fresh, high-quality content more efficiently. Still, a sitemap does not guarantee indexing on its own. Just because a URL appears in a sitemap does not mean it will definitely be listed in Google. The page must also be valuable, accessible, indexable, canonically correct, and aligned with user intent.

When Do You Need a Sitemap?

When you have launched a new website.
When your site has many pages, products, or blog posts.
When your internal linking structure is weak.
When you publish many images, videos, or news-style content.
When your e-commerce site frequently updates products.
When you regularly refresh older content.

Even on a small website with a clean internal linking structure, using a sitemap is considered a best practice. It gives search engines a clear URL list and reduces the chance of discovery delays.

Differences Between Robots.txt and Sitemap

Although robots.txt and sitemap files work together, they have different jobs. Robots.txt is mainly about crawl permissions and restrictions, while a sitemap lists the URLs you want search engines to discover. The table below summarizes the core differences.

Differences Between Robots.txt and Sitemap
Feature	Robots.txt	Sitemap
Main purpose	To guide which areas bots should crawl	To inform search engines about important URLs
File location	Root directory: /robots.txt	Usually /sitemap.xml
Format	Plain text	XML
Does it guarantee indexing?	No	No
Risk of incorrect use	Can block important pages from crawling	Can submit low-quality or noindex pages
SEO impact	Helps manage crawl budget	Improves URL discovery and update signals

How to Create a Robots.txt File

Creating a robots.txt file is technically simple, but it requires SEO awareness. The file name must be lowercase as robots.txt and it must be uploaded to the root directory of the website. In other words, the correct address is https://yourdomain.com/robots.txt. A robots.txt file uploaded to a subfolder is not considered valid for the entire site.

1. Create the Basic Robots.txt Structure

The simplest structure allows all bots to crawl the website and also provides the sitemap location:

User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml

Here, User-agent: * refers to all bots. Allow: / permits crawling of the entire site. The Sitemap line tells bots where the sitemap is located. For a newly launched website that you want indexed, this structure is usually a safe starting point.

2. Identify Areas You Do Not Want Crawled

Not every page needs to be crawled. Pages that are user-specific, temporary, duplicate, or low in SEO value can often be restricted with robots.txt. For example:

Disallow: /wp-admin/
Disallow: /cart/
Disallow: /checkout/
Disallow: /search/
Disallow: /test/

On WordPress websites, blocking the /wp-admin/ folder from crawling is common. However, WordPress needs access to /wp-admin/admin-ajax.php for certain AJAX functions used by themes and plugins. For that reason, a WordPress robots.txt example may look like this:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://yourdomain.com/sitemap.xml

In this example, the admin panel is blocked from crawling while AJAX processes required by themes and plugins remain accessible. To keep your WordPress site fast and stable, you can also explore WordPress Hosting services.

3. Control Parameters and Filters for E-Commerce Sites

On e-commerce websites, filters and sorting options such as color, size, price range, stock status, and search parameters can generate a large number of URLs. For example, the same category can multiply into variations such as /shoes?color=black, /shoes?size=42, or /shoes?sort=price_asc. If this structure is not controlled, Googlebot may spend time crawling thousands of low-value parameter URLs.

For these areas, robots.txt, canonical tags, and Google Search Console data should be evaluated together. Blocking every parameter with robots.txt is not always the right solution. Some filtered pages may match strong commercial search intent. For instance, a category such as black men’s running shoes may have SEO value and should be planned as a separate indexable category page rather than treated as a throwaway filter URL.

4. Do Not Block CSS and JavaScript Files

In modern SEO, Google does not evaluate pages only as raw HTML; it also looks at the rendered version of the page. Blocking CSS and JavaScript files can make it harder for Google to understand the page layout, mobile usability, menus, or content loading behavior. Broad rules such as Disallow: /assets/ or Disallow: /js/, which were used more casually in the past, are risky today.

The safer approach for 2026 is this: CSS, JavaScript, images, and font files that shape the user experience should be accessible to bots. Only directories that truly do not need to be crawled, such as admin, temporary, or private areas, should be restricted.

5. Test Your Robots.txt File

After uploading the file, always test it. Here are the key checks to perform:

Does https://yourdomain.com/robots.txt open with a 200 status code?
Is the file empty, incorrect, or pointing to the wrong domain?
Does the Sitemap line show the correct URL?
Are important category, product, service, and blog pages accidentally blocked?
Are CSS, JavaScript, and image resources mistakenly restricted?

You can use the URL Inspection tool in Google Search Console to check whether important pages are crawlable. Analyzing server logs to see which URLs Googlebot visits is also an advanced but highly valuable method. For stronger server performance and proper configuration, VPS Server or Corporate Hosting options may be worth considering.

How to Create a Sitemap

When preparing a sitemap, the goal is to give search engines a clean list of high-quality URLs that you want indexed. Not every URL must be included in the sitemap. In fact, adding noindex, redirected, error-generating, or duplicate pages to your sitemap can send poor technical SEO signals.

1. Add Only Indexable URLs

The pages you add to your sitemap should meet these criteria:

They should return a 200 status code.
They should not contain a noindex tag.
They should not be blocked by robots.txt.
Their canonical tag should point to themselves or to the correct target.
They should contain original content that provides value to users.
They should be mobile-friendly and load quickly.

For example, deleted product pages, permanently discontinued out-of-stock products, internal search results, cart pages, and checkout pages should not appear in the sitemap. On the other hand, main category pages, important subcategories, service pages, blog posts, and active products should be included.

2. Use the XML Sitemap Format Correctly

A basic XML sitemap structure follows this logic:

<urlset> is the main container.
<url> is a separate block for each page.
<loc> contains the full URL of the page.
<lastmod> specifies the date when the page was last updated.

A sample URL entry can be thought of like this: <loc>https://yourdomain.com/services/</loc> and <lastmod>2026-01-15</lastmod>. The recommended date format is year-month-day. It is important to update the lastmod field automatically and accurately. Changing the date of every URL every day just to “ping” Google is not a trustworthy practice.

3. Split Sitemaps into Sections on Large Sites

A standard XML sitemap file should contain no more than 50,000 URLs and should not exceed 50 MB uncompressed. For large websites, using a sitemap index instead of a single sitemap is the healthier approach. For example:

/post-sitemap.xml
/page-sitemap.xml
/product-sitemap.xml
/category-sitemap.xml
/image-sitemap.xml

This structure helps search engines process files more efficiently and makes it easier to diagnose indexing problems by content type. For instance, if only 8,000 out of 20,000 URLs in the product sitemap are indexed, product descriptions, stock status, duplicate content, page speed, or filter structure should be reviewed separately.

4. Creating a Sitemap in WordPress

WordPress versions 5.5 and later include a built-in XML sitemap feature. By default, it is usually available at /wp-sitemap.xml. However, in many professional projects, SEO plugins such as Rank Math, Yoast SEO, or similar tools are preferred because they provide more advanced sitemap control. With these plugins, you can decide which content types should be included in the sitemap, whether tag archives should appear, and how author archives should be managed.

A common mistake on WordPress websites is adding low-value tag pages to the sitemap. If tag pages do not have unique descriptions, strong internal links, and real search demand, it is often better to keep them out of the sitemap. To strengthen your content strategy, you can also refer readers to How to Write an SEO Compatible Blog Post.

5. Set Up Sitemap Automation for Custom-Built Websites

On custom-developed websites, a sitemap can be created manually, but dynamic projects need automated generation. When a product is added, a blog post is published, or a service page is updated, the sitemap should update automatically as well. Your development team should apply the following rules:

Published pages should be added to the sitemap automatically.
Deleted URLs or URLs returning 404 should be removed from the sitemap.
Pages marked noindex should not be included in the sitemap.
Pages with a different canonical target should be managed carefully.
Lastmod should update only when there is a real content change.

This automation is critical for technical SEO health, especially on frequently updated news, classifieds, booking, education, and e-commerce projects.

How to Include a Sitemap in Robots.txt

Adding the sitemap URL to the bottom of the robots.txt file is a good practice. This helps bots find your sitemap easily. Example usage:

User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml

If you have multiple sitemap files, you can list each one on a separate line:

Sitemap: https://yourdomain.com/post-sitemap.xml
Sitemap: https://yourdomain.com/product-sitemap.xml
Sitemap: https://yourdomain.com/category-sitemap.xml

If your domain uses HTTPS, your sitemap URLs should also use HTTPS. HTTP, www, and non-www variations should not be mixed. That is why your domain, SSL, and redirect structure should be planned correctly from the beginning. If you are launching a new project, consider Domain Lookup and SSL Certificate together with your technical SEO plan.

Submitting a Sitemap to Google Search Console

After creating your sitemap, you should submit it through Google Search Console. The steps are as follows:

Sign in to Google Search Console.
Select the correct property. A domain property is usually preferred.
Go to the Sitemaps section in the left-hand menu.
Enter the sitemap URL, for example sitemap.xml.
Click the Submit button.
Check the Status area for the Success message and the number of discovered URLs.

Do not expect all pages to be indexed immediately after submitting a sitemap. Google first discovers URLs, then crawls and processes them, and finally decides whether to index them based on quality signals. For new websites, this process can take anywhere from a few days to several weeks. Strong internal linking, high-quality content, and fast server response times can help the process move more smoothly.

Common Robots.txt and Sitemap Mistakes

1. Accidentally Blocking the Entire Site

The most critical mistake is leaving the Disallow: / rule on a live website. This rule prevents the entire site from being crawled. If this setting is used in a development environment and not removed before launch, Google will not be able to crawl new pages. Robots.txt should always be part of your go-live checklist.

2. Adding Noindex Pages to the Sitemap

Giving a page a noindex directive and adding the same page to the sitemap creates a conflicting signal. The sitemap says, this page is important, while noindex says, do not include this page in the index. For this reason, your sitemap should contain only the URLs you actually want indexed.

3. Keeping 301, 404, or 500 URLs in the Sitemap

URLs inside the sitemap should ideally return a 200 status code. Redirected URLs, not found pages, or URLs that return server errors should be cleaned up regularly. Running a monthly technical SEO crawl helps you catch these issues early.

4. Using the Wrong Domain or Protocol

If you use https://www.yourdomain.com, the URLs inside your sitemap should follow the same format. Mixing different protocol or domain variations can make it harder for Google to consolidate signals. Therefore, canonical tags, sitemap URLs, robots.txt, and redirects should all point to the same primary URL format.

5. Submitting Too Many URLs

A sitemap is not a dumping ground. Instead of adding every URL, include the high-quality pages you truly want indexed. Keeping low-quality, duplicate, or thin pages out of the sitemap sends search engines a cleaner signal.

Technical SEO Checklist for 2026

Use the following checklist when preparing robots.txt and sitemap files:

Is robots.txt in the root directory and accessible?
Is the sitemap URL correctly listed inside robots.txt?
Are important pages not blocked by robots.txt?
Are CSS, JavaScript, and image resources crawlable?
Does the sitemap include only indexable URLs returning 200 status codes?
Are noindex pages excluded from the sitemap?
Do lastmod dates reflect real updates?
Is a sitemap index used on large websites?
Has the sitemap been processed successfully in Google Search Console?
Do server response times support efficient crawling?

Technical SEO is not limited to creating files. Hosting performance, SSL configuration, DNS accuracy, redirects, mobile usability, and content quality also have a direct impact. That is why it is useful to evaluate Hosting Packages, Domain Transfer, and Website Security together when planning your website infrastructure.

Example Robots.txt and Sitemap Strategy

For a simple business website, the recommended structure might be as follows: the homepage, service pages, about page, contact page, and blog posts appear in the sitemap. The admin panel, form thank-you pages, temporary campaign tests, and internal search results are managed with robots.txt or noindex. On this type of website, the sitemap usually contains between 20 and 200 URLs.

For a mid-sized e-commerce site, product, category, brand, and blog sitemaps can be kept separate. Active products are added to the sitemap, permanently removed products are taken out, and 301 redirects are created to similar products where appropriate. Filter URLs are analyzed one by one. Filters with search volume and conversion potential are structured as dedicated categories, while the rest are controlled through robots.txt, canonical tags, or a noindex strategy.

For a content-heavy blog or news website, publication dates, update dates, category structure, and internal linking are extremely important. When older content is refreshed, lastmod should change accurately, but artificial date updates should be avoided. The signal Google trusts is real content improvement.

Frequently Asked Questions

Does robots.txt completely prevent indexing?

No. Robots.txt blocks crawling; it does not always fully prevent indexing. If a URL is linked from other websites, Google may still show that URL in the index without crawling it. To prevent indexing, you generally need a noindex tag or an appropriate access restriction.

Does a sitemap help a website rank higher on Google?

A sitemap does not directly guarantee better rankings. However, it helps important pages get discovered faster, communicates updates to search engines, and improves technical SEO health. Rankings also depend on content quality, links, user experience, speed, and trust signals.

Is it mandatory to include the sitemap in robots.txt?

No, it is not mandatory, but it is recommended. Adding the sitemap URL to robots.txt makes it easier for search engines to find your sitemap. Submitting the sitemap through Google Search Console is also a good practice.

What is the WordPress sitemap URL?

The default WordPress sitemap URL is usually /wp-sitemap.xml. If you use SEO plugins, the sitemap address may be /sitemap_index.xml or /sitemap.xml. You should check the exact address based on the plugin you use.

How many URLs can a sitemap contain?

A single XML sitemap file should contain no more than 50,000 URLs and should not exceed the 50 MB limit. For larger sites, the best approach is to use a sitemap index and split content into separate files such as pages, posts, products, categories, or images.

Conclusion

Robots.txt and sitemap files are two core parts of technical SEO that may look small but can have a major impact. Robots.txt guides bot crawling behavior, while the sitemap makes important URLs easier to discover. For a correct setup, keep important pages open, restrict unnecessary areas in a controlled way, add only indexable URLs to your sitemap, and monitor performance regularly through Google Search Console.

If you want to build a strong technical foundation for your website, starting with reliable hosting, proper domain management, and SSL configuration is a smart move. By exploring Hostragons’ Web Hosting, Domain, and SSL Certificate solutions, you can create a fast, secure, and SEO-friendly infrastructure for your site.

How to Create Robots.txt and XML Sitemap Files for SEO