Tuesday, September 24, 2013

Implementation of Cosine Similarity [JAVA and Python Example]





Given two vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product and magnitude as:



   This metric is frequently used when trying to determine similarity between two documents. In this similarity metric, the attributes (or words, in the case of the documents) is used as a vector to find the normalized dot product of the two documents. By determining the cosine similarity, the user is effectively trying to find cosine of the angle between the two objects. For cosine similarities resulting in a value of 0, the documents do not share any attributes (or words) because the angle between the objects is 90 degrees.

   Below there are two possible implementations written in Java and in Python which you might find handy if you want to develop your own projects.


Java implementation:


public class CosineSimilarity 
{
  /**
     * Method to calculate cosine similarity between two documents.
     * @param docVector1 : document vector 1 (a)
     * @param docVector2 : document vector 2 (b)
     * @return 
     */
    public double cosineSimilarity(double[] docVector1, double[] docVector2) 
    {
        double dotProduct = 0.0;
        double magnitude1 = 0.0;
        double magnitude2 = 0.0;
        double cosineSimilarity = 0.0;

        for (int i = 0; i < docVector1.length; i++) //docVector1 and docVector2 must be of same length
        {
            dotProduct += docVector1[i] * docVector2[i];  //a.b
            magnitude1 += Math.pow(docVector1[i], 2);  //(a^2)
            magnitude2 += Math.pow(docVector2[i], 2); //(b^2)
        }

        magnitude1 = Math.sqrt(magnitude1);//sqrt(a^2)
        magnitude2 = Math.sqrt(magnitude2);//sqrt(b^2)

        if (magnitude1 != 0.0 | magnitude2 != 0.0)
        {
            cosineSimilarity = dotProduct / (magnitude1 * magnitude2);
        } 
        else 
        {
            return 0.0;
        }
        return cosineSimilarity;
    }
}
    *In addition you should know how to implement the tf/idf (term frequency-inverse document frequency) for every term(word) of the document , since those are the values of the attributes A and B. You can find a mathematical explanation of tf/idf with an example written in Java here!

Python implementation:



# Input: 2 vectors
# Output: the cosine similarity
# !!! Untested !!!
def cosine_similarity(vector1,vector2):
  # Calculate numerator of cosine similarity
  dot = [vector1[i] * vector2[i] for i in range(vector1)]
  
  # Normalize the first vector
  sum_vector1 = 0.0
  sum_vector1 += sum_vector1 + (vector1[i]*vector1[i] for i in range(vector1))
  norm_vector1 = sqrt(sum_vector1)
  
  # Normalize the second vector
  sum_vector2 = 0.0
  sum_vector2 += sum_vector2 + (vector2[i]*vector2[i] for i in range(vector2))
  norm_vector2 = sqrt(sum_vector2)
  
  return (dot/(norm_vector1*norm_vector2))

No comments:

Post a Comment