How to Create Robots.txt File | How to Implement robots.txt file | Implement robot.txt file

Jyotishgher Astrology
By -
0

How to Create Robots.txt File | How to Implement robots.txt file | Implement robot.txt file




Robots.txt is a text file webmasters create to instruct web robots (typically search engine robots) how to crawl pages on their website. The robots.txt file is part of the the robots exclusion protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content, and serve that content up to users. 



In practice, robots.txt files indicate whether certain user agents (web-crawling software) can or cannot crawl parts of a website. These crawl instructions are specified by “disallowing” or “allowing” the behavior of certain (or all) user agents.
Basic format:
User-agent: [user-agent name]
Disallow: [URL string not to be crawled]

Example robots.txt:

Here are a few examples of robots.txt in action for a www.example.com site:
Robots.txt file URL: www.example.com/robots.txt
Blocking all web crawlers from all content
User-agent: * 
Disallow: /
Using this syntax in a robots.txt file would tell all web crawlers not to crawl any pages on www.example.com, including the homepage.
Allowing all web crawlers access to all content
User-agent: * 
Disallow: 
Using this syntax in a robots.txt file tells web crawlers to crawl all pages on www.example.com, including the homepage.
Blocking a specific web crawler from a specific folder
User-agent: Googlebot 
Disallow: /example-subfolder/
This syntax tells only Google’s crawler (user-agent name Googlebot) not to crawl any pages that contain the URL string www.example.com/example-subfolder/.

Other quick robots.txt must-knows:

(discussed in more detail below)
  • In order to be found, a robots.txt file must be placed in a website’s top-level directory.
  • Robots.txt is case sensitive: the file must be named “robots.txt” (not Robots.txt, robots.TXT, or otherwise).
  • Some user agents (robots) may choose to ignore your robots.txt file. This is especially common with more nefarious crawlers like malware robots or email address scrapers.
  • The /robots.txt file is a publicly available: just add /robots.txt to the end of any root domain to see that website’s directives (if that site has a robots.txt file!). This means that anyone can see what pages you do or don’t want to be crawled, so don’t use them to hide private user information.
  • It’s generally a best practice to indicate the location of any sitemaps associated with this domain at the bottom of the robots.txt file. Here’s an example:
Sitemaps in robots.txt

Technical robots.txt syntax

Robots.txt syntax can be thought of as the “language” of robots.txt files. There are five common terms you’re likely come across in a robots file. They include:
  • User-agent: The specific web crawler to which you’re giving crawl instructions (usually a search engine). A list of most user agents can be found here.
  • Disallow: The command used to tell a user-agent not to crawl particular URL. Only one "Disallow:" line is allowed for each URL.
  • Allow (Only applicable for Googlebot): The command to tell Googlebot it can access a page or subfolder even though its parent page or subfolder may be disallowed.
  • Crawl-delay: How many seconds a crawler should wait before loading and crawling page content. Note that Googlebot does not acknowledge this command, but crawl rate can be set in Google Search Console.
  • Sitemap: Used to call out the location of any XML sitemap(s) associated with this URL. Note this command is only supported by Google, Ask, Bing, and Yahoo.

Post a Comment

0Comments

Post a Comment (0)