The robots.txt file is a small file located in the root folder of your Joomla site. The file contains instructions to the search engines on what to index and what to leave out.
Rules are only for Disallow
The robots.txt file is a part of the Robots EXCLUSION Protocol and as such "Allow" is NOT a part of the standard syntax. Robots.txt is all about what to DISALLOW. Bad bots of course ignore the robots.txt file, so the robots.txt file is no protection against misbehaving bots or hacking scripts. Other methods should be employed for that, such as the htaccess file or other.
Robots.txt is NOT about security!
Most bad bots will NOT respect the robots.txt file, so again, robots.txtis not about security. It is about telling nice, well-behaving search bots what they can an cannot crawl and index on your site.
Some people confuse the robots.txt file with the .htaccess file. The difference is significant. The robots.txt file only gives instructions to search engines - and most search engines respect it.The .htaccess file, on the other hand is used to reconfigure the settings of your Apache server, redirect URLs and other server related tasks.
The robots.txt file and SEO
As mentioned, the robots.txt file is in your site root folder. It contains info on which folders should be indexed and not. It can also include information about your XML sitemap.
1. Remove exclusion of certain images folders
By default the robots.txt file in Joomla excludes the images folder. If you want to have some images indexed by the search engines (for example images included in articles), it is a good plan to put them into a specific folder, e.g. /images/article-images and then to allow crawling of ONLY that subfolder.
To make this effective, open your robots.txt file and then add disallow lines for every folder that you do NOT want indexed.
Disallow: /images/subfolder-c, etc.
So, by NOT including a Disallow rule for the /images/article-images subfolder, Google and others will start indexing your article-images on the next crawl of your site, but leave out the disallowed subfolders.
2. Add a reference to your sitemap.xml file
If you have a sitemap.xml file (and you should have!), include the following line in your robots.txt file:
Naturally, this line needs to be adjusted to fit your domain and sitemap file. In my case, I use the Xmap component to create the Sitemap XML file automatically.
So, the line looks like this:
Remove the trailing slashes from the folders if you want to completely disallow those folders from being crawled. As long as you dont have any files with those names!
Example: "GET /administrator" would not be stopped by
Disallow: /administrator/, because robots.txtis based on 'prefix match' beginning at the root "/"
Disallow: /admin on the other hand would disallow any URL with the prefix /admin including:
It simply matches the prefix of the URL beginning at the root.
Again, this will not prevent any robot from access, it simply asks them to "please not go there" . For real protection you may want to use server side commands like .htaccess or equivalent.