Term frequency-inverse document frequency is a very well-known algorithm in Information retrieval domain which is used to give a weight of a term. The algorithm takes a collection of the document as an input and computes the frequency count of words in the corpus. Additionally, it computes the inverse document frequency which is used to determine the importance of words in the corpus. Words with common occurrence around the corpus get the high weight which shows the importance of the word in the document. Often less useful words like ‘the’, ‘a’, & ‘or’ preposition possess highest term count. An offset is used to keep the low weight in less useful terms. Term frequency of a given document is the total count of a term that appears in the document. And it is defined as (Karamcheti 2010, pp 29-31):
Where is the count of terms that occur in the document , and the denominator is the sum of the counts of all the terms that occur in the document .
The inverse document frequency of a given document is obtained by dividing the total count of documents by the count of documents containing the term, and then taking the logarithm of that quotient. Hence it is defined as (Karamcheti 2010, pp 29-31):
Where is the total number of documents in our corpus. is the number of documents where the term appears (that is is not equal to 0). If the term is not in the collection, this will lead to a division by zero. Thus, it is common to use . The weight or tf*idf value of a term is always greater than or equal to zero.
The equation 4.3 gives the simplest mathematical overview of tfidf. This equation consists of two parts. The first part is related to term frequency count and the second part is related to logarithm value of inverse document frequency. This inverse document is acquired by dividing a total number of documents with some time a term occurs in the document. When both values from both parts are acquired, tfidf vector is produced by taking the cross product of both parts.
Tf-idf takes it one step further. It considers the term frequency but also takes the token specificity into account. This combination is called term frequency-inverse document frequency. For example, the token ‘the’ often occurs with high frequency in all texts. A tf-idf vectorizer will assign a low value to this token. A word that often occurs in few texts but less in other texts will receive a higher value.
Consider the example corpus taken from table 7. It contains two documents and each of the document contains many words which are common in both documents such as; ‘is’, ‘working’, ‘on’, ‘industry’ and some words which are not common like; ‘BMW’, ‘automotive’, ‘Bosch’, ‘electronics’. An example tf-idf matrix for the given corpus is shown in table 9. It is clearly presented in the table that only words which are uncommon in both documents got the tf-idf weight, by looking at the last column in table, we will immediately realize that this corpus is about ‘Bosch’ and ‘electronics’.
In classification, these weights presented the importance of each words which have more weightage. Tf-idf is simple and fast and it often produces good result in practice but it has certain limitation when it comes to deal with challenging tasks. Moreover, it does not recognize synonym and plural. As pointed out by Ramos, J. (2003), that tf-idf cannot differentiate ‘car’ and ‘cars’.