There is many ways to tell the google bot not to index pages:
Using the Allow Directive in the robot.txt
User-Agent: gsa-crawler
Disallow: /folder1/
Allow: /folder1/myfile.html
Using Robots META Tags to Control Access to a Web Page
Using .htaccess
User-agent: *
Disallow: /tralllala
But how does it work, if you just want part from your pages not get craweled?
Excluding Unwanted Text from the Google Index is still not as easy as it should be.
Here is a video answer from Matt Cutts to this question:
Google still doesn’t offer any solution that allows us to exclude parts from pages from geting indexed, using noindexed iframes is too not a real solution.
So how to Noindex parts of a web page?
It would be cool if there would be such a solution like Google offers this for the Google Search Appliance there you can exclude part from pages from geting indexed.
The Google Search Appliance supports “googleon” and “googleoff” tags, special proprietary HTML tags that can be embedded in the HTML of crawled documents to prevent searching of text between these special tags.
The googleoff/googleon tags disable the indexing of a part of a web page. The result is that those pages do not appear in search results when users search for the tagged word or phrase. For example, some customers use googleoff/googleon tags to comment out a navigation bar in static HTML pages.
You can use googleon/off to tell the Google Search Appliance to ignore portions of a page. Insert at the point you want the Google Search Appliance to stop indexing, then insert where you want it to resume indexing the page.
GoogleOn and GoogleOff tags (which may see live on adobe.com) will be ignored by regular Google spiders or other search engines. They make sense only when used in conjunction with Google Search Appliance or possibly, Google Mini.
http://code.google.com/apis/searchappliance/documentation/46/admin_crawl/Preparing.html
I think google should offer such GoogleON and GoogleOf Tags not just for the Google Search Appliance, these Tags should work for the normal Googlebot too. This would help to get Webpages indexed more proper.
Leave a Reply