TF-IDF or term frequency-inverse document frequency in its simplest form is a weighting system given to the words and terms used in a document or webpage.
Using this weighting method we can create some statistical analysis of how important a word or phrase is in a given document against a collection or corpus of other similar documents.
It has been alluded to on more than one occasion that the TF-IDF weighting system is employed by search engines to help it understand if a web page is really about the topic it thinks it is. By analysing all of the other top results and seeing what commonly used words are in these web pages, it then checks these results against a web page it is looking to rank.
The search engine can then, not only, be more confident that the page is a good match for that topic but can also make an assessment as to its level of detailed coverage.
Isn't this just semantic words?
No. Semantic word matching is basically using words that mean the same thing without being written the same or using the same words.
Let's take the phrase "software development", on pages that target this term you may also commonly see the words or phrases:
- Bespoke Platform
- Custom Built Portal
- PHP, C#, Java
- Company Solutions
Google, for example, may understand that if a web page seems to target the term "software development" then statistically it should mention at least some of these words or phrases too.
The Frequency of words is also important
It's not just a case of throwing these words into your document haphazardly, this weighting system also (and probably more importantly) assesses the frequency these additional words and phrases appear in a document.
This frequency is usually what the novice SEO company tends to miss out. Search engines will take a note of the frequency of these words over x amount of total words as an average across the corpus and when another site seems to fall into that frequency it's a good bet that they are using it correctly.
Using TF-IDF in your SEO efforts
Using TF-IDF in your SEO efforts can be a fantastic way of increasing the value of your content to your users by making sure you cover all the topics that users are generally looking for when searching for a specific term.
As we all know, if you are increasing the value to the user, search engines like Google and Bing will notice this and rank your content higher.
To use the TF-IDF formula directly can be quite an undertaking but we'll explain it here before recommending some different software that can do this for you.
The TF-IDF formula
First of all, lets split the term in half:
- TF (term frequency): This is simply how often a word is used within the document, so we take the term and divide it by the content length.
(tf) = (term / document length)
- IDF (inverse document frequency): We now need to devalue the most common words like "the", "is" and "at" for example, and scale up the more rare words. For this we need to get a little more technical, we need to use logarithm base 10 (log_e) and divide the total amount of documents in the corpus by the documents in the corpus with that word.
(idf) = log_e(total documents / documents with word)
If we are optimising for the phrase "web design" and use it 3 times on our page and our page has 100 words (pretty small I know) then it's simply:
(tf) = (3 / 100) which is 0.03
Now, if we have a corpus of 1000 (maybe the first 1000 websites that rank for that word in Google) websites around the same topic and 50 of them mention that exact phrase then its:
(idf) = log_e(1000 / 50) which is 1.69
So now we know what our score is for that word:
(tf) * (idf) = 0.05
Once we have our score we will also score the top 10 websites listed on Google for that word. If our score is well below the average of the sites on the first page then we are under-optimised for that word, conversely if it is over then we are over-optimised for that word.
Now, this is where it becomes impossible (or at least extremely hard) for a human to continue.
Taking the number 1 on the first page listing for the term and removing all common words such as "is", "this", "that" etc. We now run a score on each and every word and phrase and continue to do this for each of the top 10 pages.
Once we have spent about a year doing that we now have the score for all the common topics we should also consider covering on our page!
Software to use for TF-IDF analysis
If you are a sane person, you will realise that doing this manually is not at all reasonable or even possible, so here are a few tools we here at OMY Digital use for this work.
- Our most used tool: Website Auditor in SEO Powersuite
- One we like the look of and waiting to try: Text Tools
OMY Digital uses TF-IDF for our clients ;-)
When we perform SEO for our clients, methods like TF-IDF analysis is our go to. We do not just optimise your metas and titles, our methods are always cutting edge and more importantly consistently successful.