To avoid undesirable content in the search indexes, webmasters can instruct spiders not to crawl certain files or directories through the standard robots.txt file in the root directory of the domain. Additionally, a page can be explicitly excluded from a search engine's database by using a meta tag specific to robots (usually ). When a search engine visits a site, the robots.txt located in the root directory is the first file crawled. The robots.txt file is then parsed and will instruct the robot as to which pages are not to be crawled. As a search engine crawler may keep a cached copy of this file, it may on occasion crawl pages a webmaster does not wish crawled. Pages typically prevented from being crawled include login specific pages such as shopping carts and user-specific content such as search results from internal searches. In March 2007, Google warned webmasters that they should prevent indexing of internal search results because those pages are considered search spam.
Yes, you need to build links to your site to acquire more PageRank, or Google ‘juice’ – or what we now call domain authority or trust. Google is a link-based search engine – it does not quite understand ‘good’ or ‘quality’ content – but it does understand ‘popular’ content. It can also usually identify poor, or THIN CONTENT – and it penalises your site for that – or – at least – it takes away the traffic you once had with an algorithm change. Google doesn’t like calling actions the take a ‘penalty’ – it doesn’t look good. They blame your ranking drops on their engineers getting better at identifying quality content or links, or the inverse – low-quality content and unnatural links. If they do take action your site for paid links – they call this a ‘Manual Action’ and you will get notified about it in Webmaster Tools if you sign up.
If you want to *ENSURE* your FULL title tag shows in the desktop UK version of Google SERPs, stick to a shorter title of between 55-65 characters but that does not mean your title tag MUST end at 55 characters and remember your mobile visitors see a longer title (in the UK, in January 2018). What you see displayed in SERPs depends on the characters you use. In 2019 – I just expect what Google displays to change – so I don’t obsess about what Google is doing in terms of display. See the tests later on in this article.
I do not obsess about site architecture as much as I used to…. but I always ensure my pages I want to be indexed are all available from a crawl from the home page – and I still emphasise important pages by linking to them where relevant. I always aim to get THE most important exact match anchor text pointing to the page from internal links – but I avoid abusing internals and avoid overtly manipulative internal links that are not grammatically correct, for instance..
QUOTE: “We are a health services comparison website…… so you can imagine that for the majority of those pages the content that will be presented in terms of the clinics that will be listed looking fairly similar right and the same I think holds true if you look at it from the location …… we’re conscious that this causes some kind of content duplication so the question is is this type … to worry about? “
Google will INDEX perhaps 1000s of characters in a title… but I don’t think anyone knows exactly how many characters or words Google will count AS a TITLE TAG when determining RELEVANCE OF A DOCUMENT for ranking purposes. It is a very hard thing to try to isolate accurately with all the testing and obfuscation Google uses to hide it’s ‘secret sauce’. I have had ranking success with longer titles – much longer titles. Google certainly reads ALL the words in your page title (unless you are spamming it silly, of course).
Don’t underestimate these less popular keywords. Long tail keywords with lower search volume often convert better, because searchers are more specific and intentional in their searches. For example, a person searching for "shoes" is probably just browsing. On the other hand, someone searching for "best price red womens size 7 running shoe" practically has their wallet out!
The reality in 2019 is that if Google classifies your duplicate content as THIN content, or MANIPULATIVE BOILER-PLATE or NEAR DUPLICATE ‘SPUN’ content, then you probably DO have a severe problem that violates Google’s website performance recommendations and this ‘violation’ will need ‘cleaned’ up – if – of course – you intend to rank high in Google.
Try and get links within page text pointing to your site with relevant, or at least, natural looking, keywords in the text link – not, for instance, in blogrolls or site-wide links. Try to ensure the links are not obviously “machine generated” e.g. site-wide links on forums or directories. Get links from pages, that in turn, have a lot of links to them, and you will soon see benefits.
When would this be useful? If your site has a blog with public commenting turned on, links within those comments could pass your reputation to pages that you may not be comfortable vouching for. Blog comment areas on pages are highly susceptible to comment spam. Nofollowing these user-added links ensures that you're not giving your page's hard-earned reputation to a spammy site.
“Sharability” – Not every single piece of content on your site will be linked to and shared hundreds of times. But in the same way you want to be careful of not rolling out large quantities of pages that have thin content, you want to consider who would be likely to share and link to new pages you’re creating on your site before you roll them out. Having large quantities of pages that aren’t likely to be shared or linked to doesn’t position those pages to rank well in search results, and doesn’t help to create a good picture of your site as a whole for search engines, either.
Domain authority is an important ranking phenomenon in Google. Nobody knows exactly how Google calculates, ranks and rates the popularity, reputation, intent or trust of a website, outside of Google, but when I write about domain authority I am generally thinking of sites that are popular, reputable and trusted – all of which can be faked, of course.
You may not want certain pages of your site crawled because they might not be useful to users if found in a search engine's search results. If you do want to prevent search engines from crawling your pages, Google Search Console has a friendly robots.txt generator to help you create this file. Note that if your site uses subdomains and you wish to have certain pages not crawled on a particular subdomain, you'll have to create a separate robots.txt file for that subdomain. For more information on robots.txt, we suggest this Webmaster Help Center guide on using robots.txt files13.