Wednesday, January 29, 2014

A simple java class for tf*idf scoring



This post can be seen as a complement/extension of this post.


Our ultimate purpose is to calculate the TF and IDF of all the terms(=words) of a corpus(=set of text documents). The set of documents was indexed with the help of Lucene4.3 , but migrating to the latest version of Lucene(4.6@this time) is "painless" for the time being. Check the official migration guide to the latest version here.

The java class which is provided below is based on these classes:
  
import java.io.IOException;
import org.apache.lucene.index.DocsEnum;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.MultiFields;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.similarities.DefaultSimilarity;
import org.apache.lucene.search.similarities.TFIDFSimilarity;
import org.apache.lucene.util.Bits;
import org.apache.lucene.util.BytesRef;

This class is calculating the tf*idf score for every term in every document using the latest techniques of the Lucene API:

public class Tf_Idf 
{
  static float tf = 1;
  static float idf = 0;
  private float tfidf_score;
    
  static float[] tfidf = null;

  
 public void scoreCalculator(IndexReader reader,String field,String term) throws IOException 
     { 
         /** GET TERM FREQUENCY & IDF **/ 
         TFIDFSimilarity tfidfSIM = new DefaultSimilarity();
         Bits liveDocs = MultiFields.getLiveDocs(reader);
         TermsEnum termEnum = MultiFields.getTerms(reader, field).iterator(null);
         BytesRef bytesRef;
         while ((bytesRef = termEnum.next()) != null) 
         {           
           if(bytesRef.utf8ToString().trim() == term.trim())
           {                  
              if (termEnum.seekExact(bytesRef, true)) 
              {
                 idf = tfidfSIM.idf(termEnum.docFreq(), reader.numDocs());
                 DocsEnum docsEnum = termEnum.docs(liveDocs, null);
                 if (docsEnum != null) 
                 {
                    int doc; 
                    while((doc = docsEnum.nextDoc())!=DocIdSetIterator.NO_MORE_DOCS) 
                     {
                         tf = tfidfSIM.tf(docsEnum.freq());
                         tfidf_score = tf*idf; 
                         System.out.println(" -tfidf_score- " + tfidf_score);
                     }
                 } 
             } 
           }
        } 
     }
  
}

No comments:

Post a Comment