Wednesday, January 29, 2014

A simple java class for tf*idf scoring



This post can be seen as a complement/extension of this post.


Our ultimate purpose is to calculate the TF and IDF of all the terms(=words) of a corpus(=set of text documents). The set of documents was indexed with the help of Lucene4.3 , but migrating to the latest version of Lucene(4.6@this time) is "painless" for the time being. Check the official migration guide to the latest version here.

The java class which is provided below is based on these classes:
  
import java.io.IOException;
import org.apache.lucene.index.DocsEnum;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.MultiFields;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.similarities.DefaultSimilarity;
import org.apache.lucene.search.similarities.TFIDFSimilarity;
import org.apache.lucene.util.Bits;
import org.apache.lucene.util.BytesRef;

This class is calculating the tf*idf score for every term in every document using the latest techniques of the Lucene API:

public class Tf_Idf 
{
  static float tf = 1;
  static float idf = 0;
  private float tfidf_score;
    
  static float[] tfidf = null;

  
 public void scoreCalculator(IndexReader reader,String field,String term) throws IOException 
     { 
         /** GET TERM FREQUENCY & IDF **/ 
         TFIDFSimilarity tfidfSIM = new DefaultSimilarity();
         Bits liveDocs = MultiFields.getLiveDocs(reader);
         TermsEnum termEnum = MultiFields.getTerms(reader, field).iterator(null);
         BytesRef bytesRef;
         while ((bytesRef = termEnum.next()) != null) 
         {           
           if(bytesRef.utf8ToString().trim() == term.trim())
           {                  
              if (termEnum.seekExact(bytesRef, true)) 
              {
                 idf = tfidfSIM.idf(termEnum.docFreq(), reader.numDocs());
                 DocsEnum docsEnum = termEnum.docs(liveDocs, null);
                 if (docsEnum != null) 
                 {
                    int doc; 
                    while((doc = docsEnum.nextDoc())!=DocIdSetIterator.NO_MORE_DOCS) 
                     {
                         tf = tfidfSIM.tf(docsEnum.freq());
                         tfidf_score = tf*idf; 
                         System.out.println(" -tfidf_score- " + tfidf_score);
                     }
                 } 
             } 
           }
        } 
     }
  
}

1 comment:

  1. Tula's International School is the best Dehradun boarding schools for girls & boys. It is one of the top schools in Dehradun.The school is affiliated to CBSE which offers holistic education to students.

    Tula's International School Best Boarding School in Dehradun

    Tula's International School Best Boarding School in Dehradun

    Tula's International School Co-ed Boarding School in Dehradun

    Tula's International School Best Residential School in Dehradun

    Tula's International School Dehradun Boarding School Fee structure

    Tula's International School Top Girls Boarding School India

    Tula's International School Best CBSE Schools in Uttarakhand

    Tula's International School Top Boarding Schools in India

    Tula's International School Best Boarding School in Dehradun

    Tula's International School Top Boarding Schools in Dehradun

    ReplyDelete