## Tuesday, September 24, 2013

### Implementation of TF/IDF (Term Frequency-Inverse Document Frequency) in JAVA

Tf–Idf is the product of two statistics, term frequency and inverse document frequency. Various ways for determining the exact values of both statistics exist. In the case of the term frequency tf(t,d), the simplest choice is to use the raw frequency of a term in a document, i.e. the number of times that term t occurs in document d. If we denote the raw frequency of t by f(t,d), then the simple Tf scheme is tf(t,d) = f(t,d).

The inverse document frequency is a measure of whether the term is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient .

Where:
$|D|$  is the cardinality, or the total number of documents in the corpus and respectively
$|\{d \in D: t \in d\}|$ is the number of documents where the term appears.

Then tf–idf is calculated as:

A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf-idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0.

NOTE: For a deeper dive into theory read this.

Practically..

A possible implementation (written in Java) for calculating Tf/Idf of the terms of a document is given below:

public class Tf_Idf
{
/**
* Calculated the tf of term termToCheck
* @param totalterms : Array of all the words under processing document
* @param termToCheck : term of which tf is to be calculated.
* @return tf(term frequency) of term termToCheck
*/
public double tfCalculator(List totalterms, String termToCheck)
{
double count = 0;
for (String s : totalterms)
{
if (s.equalsIgnoreCase(termToCheck))
count++;
}
return count / totalterms.length;
}

/**
* Calculated idf of term termToCheck
* @param allTerms : all the terms of all the documents
* @param termToCheck
* @return idf(inverse document frequency) score
*/
public double idfCalculator(List<> allTerms, String termToCheck)
{
double count = 0;
for (String[] ss : allTerms)
{
for (String s : ss)
{
if (s.equalsIgnoreCase(termToCheck))
{
count++;
break;
}
}
}
return Math.log(allTerms.size() / count);
}

}

*If you want to know more about tf/idf and its applications check out the Cosine similarity example here!

### Implementation of Cosine Similarity [JAVA and Python Example]

Given two vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product and magnitude as:

This metric is frequently used when trying to determine similarity between two documents. In this similarity metric, the attributes (or words, in the case of the documents) is used as a vector to find the normalized dot product of the two documents. By determining the cosine similarity, the user is effectively trying to find cosine of the angle between the two objects. For cosine similarities resulting in a value of 0, the documents do not share any attributes (or words) because the angle between the objects is 90 degrees.

Below there are two possible implementations written in Java and in Python which you might find handy if you want to develop your own projects.

Java implementation:

public class CosineSimilarity
{
/**
* Method to calculate cosine similarity between two documents.
* @param docVector1 : document vector 1 (a)
* @param docVector2 : document vector 2 (b)
* @return
*/
public double cosineSimilarity(double[] docVector1, double[] docVector2)
{
double dotProduct = 0.0;
double magnitude1 = 0.0;
double magnitude2 = 0.0;
double cosineSimilarity = 0.0;

for (int i = 0; i < docVector1.length; i++) //docVector1 and docVector2 must be of same length
{
dotProduct += docVector1[i] * docVector2[i];  //a.b
magnitude1 += Math.pow(docVector1[i], 2);  //(a^2)
magnitude2 += Math.pow(docVector2[i], 2); //(b^2)
}

magnitude1 = Math.sqrt(magnitude1);//sqrt(a^2)
magnitude2 = Math.sqrt(magnitude2);//sqrt(b^2)

if (magnitude1 != 0.0 | magnitude2 != 0.0)
{
cosineSimilarity = dotProduct / (magnitude1 * magnitude2);
}
else
{
return 0.0;
}
return cosineSimilarity;
}
}

*In addition you should know how to implement the tf/idf (term frequency-inverse document frequency) for every term(word) of the document , since those are the values of the attributes A and B. You can find a mathematical explanation of tf/idf with an example written in Java here!

Python implementation:

# Input: 2 vectors
# Output: the cosine similarity
# !!! Untested !!!
def cosine_similarity(vector1,vector2):
# Calculate numerator of cosine similarity
dot = [vector1[i] * vector2[i] for i in range(vector1)]

# Normalize the first vector
sum_vector1 = 0.0
sum_vector1 += sum_vector1 + (vector1[i]*vector1[i] for i in range(vector1))
norm_vector1 = sqrt(sum_vector1)

# Normalize the second vector
sum_vector2 = 0.0
sum_vector2 += sum_vector2 + (vector2[i]*vector2[i] for i in range(vector2))
norm_vector2 = sqrt(sum_vector2)

return (dot/(norm_vector1*norm_vector2))


## Monday, September 23, 2013

### The importance of being Semantic

“Aristotle teaches us what names and words designate and that from one side there is mental representations (Noemata) and from the other side the process of naming and designation is being realized by the means of a designator (Subject) and an object and that one should not add any kind of intermediate element between the thought and the object.”

In other words, it is inherent in people to conceptualize the meaning of a symbol and the object that it refers to. This process relies on subjective reasoning(on some extend) and thus may not be universal, but nevertheless is common ground among all thinking animals. On a higher level logic-thinking animals (like humans) may not only understand the meaning of a term but may also "rank" it when confronting it with other terms. This demonstrates the capability of logic-thinking animals to reason and deduce facts implicitly. On the other hand computers may have been evolved in a logarithmic manner and may have strong computing capabilities but still lack of certain aspects that would classify them as intelligent thinkers. Since computers can religiously follow only rules, the Aristotelian Triangle is of no use in this case.

Semantics do not mean anything

Lazy but big steps towards a semantic universe have led us to a series of techniques and theories that can potentially give us the possibility to build an actual thinking machine. Computational linguistics and natural language processing, collaborative filtering, models developed using data mining and machine learning algorithms, AI, processes of filtering for information or patterns, methods of making automatic predictions and decisions etc.. All of the above techniques are directly (or indirectly) trying to resolve the problem of semantics.

Terms, Words, Phrases, Texts, CORPUS..

The ability of correlating terms, or words, or phrases, or even whole texts is something that can be taught to computers in an efficient(more or less) way. Given a collection of documents of the same semantic context(corpus) a practical way to do this is by giving a semantic "weight" at each term of every single document. By measuring the "weights" of each term and confronting them, we can achieve a certain classification that can be further exploited in order to measure the semantic "distance" between terms. The weighting factor is a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is called term frequency–inverse document frequency (tf–idf), which is a value that increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus. This may look like a paradox, but actually it helps to control for the fact that some words are generally more common than others which means that they are less significant and can not be taken into consideration.

PRACTICAL  NOTE*
• Term Frequency: Suppose for a document there are overall 5000 words and a word Term-Frequency occurs 5 times. Then , mathematically, its Term Frequency, TF = 5/5000 =0.001.
• Inverse Document Frequency: Suppose one bought 1Q84 series, all series. Suppose the there are 9000000000000 total words and a word "senshei" comes 70000 times in all of the series. Then, mathematically, its Inverse-Document Frequency ,IDF = log(9000000000000/70000) = .......(i am sure it gives as a number.)
And finally,   TF/IDF = TF * IDF;

"To get to know each other.."

The evaluation of the semantic proximity is done using the Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. Each term is notionally assigned a different dimension and a document is characterized by a vector where the value of each dimension corresponds to the tf–idf of the term. Cosine similarity then gives a useful measure of how similar two documents are likely to be in terms of their subject matter. The formula is given below and in our case A and B represent a document containing the respective tf-idf of the terms.

A more mathematical  explanation of the Cosine similarity and an example written in Java and Python can be found here.

-Just imagine a future in which machines will be able to conceptualize the meaning of death.

## Wednesday, September 18, 2013

### Extraction of links from an HTML document + Example (Jsoup 1.7.2)

This is an example of how we can parse an HTML document and extract all the (valid) links contained in the BODY element. The "tool" used for the manipulation and extraction is Jsoup(version 1.7.2).

The power of this API lies within the simplicity of the implementation. As a matter of fact we can efficiently extract any information in just a few lines of code. In primis an HTML File is read and parsed.

File htmlFile = new File("../workspace/HtmlParserJsoup/sample.html");
Document doc = Jsoup.parse(htmlFile, "UTF-8");

As long as you pass in a non-null string, the output is a Document containing (at least) a head and a body element.(watch out .. the reference is @import org.jsoup.nodes.Document; )

In order to extract the http links contained in the body element we initially find all the elements with the a tag, and then a loop is performed in order to check the validity of the links. The logic is pretty straight forward such as the flow of the program.

/*** GET ALL LINKS OF BODY ***/
{
}

At the end you should have nothing else but valid http links ..

### Parse HTML with JSOUP API

For this purpose the jsoup api is being used, which is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

This library implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

The core functionality of this api is displayed in the next few paragraphs along with some simple examples. Generally jsoup can.. :
• Scrape and parse HTML from a URL, file, or string
Example:

File htmlFile = new File("../workspace/HtmlParserJsoup/GET_RESPONSE.html");
Document doc = Jsoup.parse(htmlFile, "UTF-8");


• Find and extract data, using DOM traversal or CSS selectors
Example :

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src\$=.png]");
// img with src ending .png
Elements resultLinks = doc.select("h3.r > a"); // direct a after h3


• Manipulate the HTML elements, attributes, and text
Example:
Element div = doc.select("div").first(); //

div.html("lorem ipsum");
// lorem ipsum

div.prepend("First");
div.append("Last");
// now: First lorem ipsum Last

Element span = doc.select("span").first(); // One
span.wrap("");
// now:One



•  Clean user-submitted content against a safe white-list, to prevent XSS attacks
Example:
String unsafe ="Link";
String safe = Jsoup.clean(unsafe, Whitelist.basic());


•  output tidy HTML

UNAGI:

An example of HTML parsing can be found here.
Enjoy!   :-)

## Monday, September 16, 2013

### Google search engine response(JSON) structure

This document is intended for developers who want to write applications that can interact with the JSON/Atom Custom Search API. We will analyze the structure of the response that we received after an http GET request to google search engine.

Response data:
{
"kind": "customsearch#search",
"url": {
"type": "application/json",
"queries": {
"nextPage": [
{
"title": "Google Custom Search - stars",
"totalResults": "309000",
"searchTerms": "stars",
"count": 10,
"startIndex": 11,
"inputEncoding": "utf8",
"outputEncoding": "utf8",
"safe": "off",
"cx": "002551901808508802477:39t74qehyw0"
}
],
"request": [
{
"title": "Google Custom Search - stars",
"totalResults": "309000",
"searchTerms": "stars",
"count": 10,
"startIndex": 1,
"inputEncoding": "utf8",
"outputEncoding": "utf8",
"safe": "off",
"cx": "002551901808508802477:39t74qehyw0"
}
]
},
"context": {
"title": "Customized.Searching"
},
"searchInformation": {
"searchTime": 0.29926,
"formattedSearchTime": "0.30",
"totalResults": "309000",
"formattedTotalResults": "309,000"
},
"items": [
{
"kind": "customsearch#result",
"title": "Formation of the First Stars",
"htmlTitle": "Formation of the First \u003cb\u003eStars\u003c/b\u003e",
"snippet": "May 22, 2013 ... Abstract: Understanding the formation of the...",
"htmlSnippet": "May 22, 2013 \u003cb\u003e...Abstract: Unde...",
"cacheId": "CbjG-txlc30J",
"formattedUrl": "arxiv.org/abs/1305.5178",
"htmlFormattedUrl": "arxiv.org/abs/1305.5178",
"pagemap": {
"metatags": [
{
"citation_title": "Formation of the First Stars",
"citation_author": "Bromm, Volker",
"citation_date": "2013/05/22",
"citation_online_date": "2013/05/22",
"citation_pdf_url": "http://arxiv.org/pdf/1305.5178",
"citation_arxiv_id": "1305.5178"
}
]
}
},
.
.
.
.
{
"kind": "customsearch#result",
"title": "[1309.0096] Quark deconfinement transition in neutron stars...",
"htmlTitle": "[1309.0096] Quark deconfinement transition in neutron 0096",
"snippet": "Aug 31, 2013 ... Abstract: A phaseof stronginteracting marwi...",
"htmlSnippet": "Aug 31, 2013 \u003cb\u003e...\u003c/b\u003e Abstract: Aphase",
"cacheId": "_eXZTeanAq8J",
"formattedUrl": "arxiv.org/abs/1309.0096",
"htmlFormattedUrl": "arxiv.org/abs/1309.0096",
"pagemap": {
"metatags": [
{
"citation_title": "Quark deconfinement transition in neutr",
"citation_author": "Logoteta, Domenico",
"citation_date": "2013/08/31",
"citation_online_date": "2013/08/31",
"citation_pdf_url": "http://arxiv.org/pdf/1309.0096",
"citation_arxiv_id": "1309.0096"
}
]
}
}
]
}

As shown in the sample output above, the response data includes three main classes of properties:

• Search results
The main classes of data are described below:

The search metadata includes the url property, which has information about the OpenSearch template used for the results returned in this request. It also includes a queries property, which is an array of objects describing the characteristics of possible searches. The name of each object in the array is either the name of an OpenSearch query role or one of the two custom roles defined by this API: previousPage and nextPage .

Each query role object is itself an array, though usually the array contains a single element. Possible query role objects include:

request: Metadata describing the query for the current set of results. This role is always present in the response. It is always an array with just one element.
nextPage: Metadata describing the query to use for the next page of results. This role is not present if the current results are the last page. Note: This API returns up to the first 100 results only. When present, it is always a array with just one element.
previousPage: Metadata describing the query to use for the previous page of results. Not present if the current results are the first page. When present, it is always a array with just one element.

Note: The totalResults property in the objects above identifies the estimated total number of results for the search, which may not be accurate.

The context property has metadata describing the search engine that performed the search query. It includes the name of the search engine, and any facet objects it provides for refining a search.

Search results

The items array contains the actual search results. The search results include the URL, title and text snippets that describe the result. In addition, they can contain pagemap information, if applicable. If the search results include a promotions property, it contains a set of promotions.

### Convert Java Object From JSON (Gson API)

In this article, we show you an example of how to use Gson, JSON library, to convert json to object. Gson API is easy to learn and implement, what we need to know are the following two methods:

•  toJson() – Convert Java object to JSON format
•  fromJson() – Convert JSON into Java object

We suppose that an http Get request is made to Google search engine. The response from google is in json format and a part of this response is presented here:

.
.
.
"items": [
{
"kind": "customsearch#result",
"title": "Formation of the First Stars",
"htmlTitle": "Formation of the First \u003cb\",
"snippet": "May 22, 2013 ... Abstract: Understanding the formation of ...",
"htmlSnippet": "May 22, 2013 \u003cb\u003e...",
"cacheId": "CbjG-txlc30J",
"formattedUrl": "arxiv.org/abs/1305.5178",
"htmlFormattedUrl": "arxiv.org/abs/1305.5178",
"pagemap": {
"metatags": [
{
"citation_title": "Formation of the First Stars",
"citation_author": "Bromm, Volker",
"citation_date": "2013/05/22",
"citation_online_date": "2013/05/22",
"citation_pdf_url": "http://arxiv.org/pdf/1305.5178",
"citation_arxiv_id": "1305.5178"
}
]
}
},
{
"kind": "customsearch#result",
"title": "Fast radio bursts: the last sign of supramassive neutron stars",
"htmlTitle": "Fast radio bursts: the last sign of supramassive neutron",
"snippet": "Jul 4, 2013 ... Abstract: Several fast radio bursts , ...",
"htmlSnippet": "Jul 4, 2013 \u003cb\u003e...\u003c/b\u003e Abstract:....",
"cacheId": "kcClCUGrI9IJ",
"formattedUrl": "arxiv.org/abs/1307.1409",
"htmlFormattedUrl": "arxiv.org/abs/1307.1409",
"pagemap": {
"metatags": [
{
"citation_title": "Fast radio bursts: the last sign of supramassive..",
"citation_author": "Falcke, Heino",
"citation_date": "2013/07/04",
"citation_online_date": "2013/07/04",
"citation_pdf_url": "http://arxiv.org/pdf/1307.1409",
"citation_arxiv_id": "1307.1409"
}
]
}
},……]
.
.
.

Generally the response data includes three main classes of properties:

• Search results

As shown in the sample output above The items array contains the actual search results. The search results include the URL, title and text snippets that describe the result.
*In addition, they can contain pagemap information, if applicable. The metatags array contains the available metadata that come with the search result. A sample of metadata is given as an example below:

JSON METATAGS SAMPLE


.
.
citation_language
prism.publicationname: "Icarus"
prism.issn: "0019-1035"
prism.volume: "216"
prism.startingpage: "136"
prism.endingpage: "140"
dc.language: "en"
dc.identifier: "doi:10.1016/j.icarus.2011.08.022"
dc.date: "2011-11"
dc.source: "Icarus, Volume 216, Issue 1, p. 136-140."
eprints.creators_name: "Barnes, Jason W."
eprints.creators_id: "Barnes-J-W"
eprints.type: "article"
eprints.datestamp: "2011-12-19 19:04:30"
eprints.lastmod: "2011-12-19 19:04:30"
eprints.title: "Organic sedimentary deposits in Titan’s dry lakebeds: Probable evaporite"
eprints.ispublished: "pub"
eprints.full_text_status: "none"
eprints.keywords: "Titan"
eprints.date: "2011-11"
eprints.date_type: "published"
eprints.publication: "Icarus"
eprints.volume: "216"
eprints.number: "1"
eprints.publisher: "Elsevier"
.
.


JAVA CLASS:

The java class reads data from response.json ,then “maps” them back to a java object and displays it.                                                                                                                       *Note:Dependencies are omitted

public class CustomSearch {

static private String title= new String();
static private String htmlTitle= new String();
static private String link= new String();
static private String formattedUrl= new String();
static private String htmlFormattedUrl= new String();
static private List metatagList= new ArrayList();

public static void main(String[] args) throws IOException
{

File jsonFile = new File("C:\\..\\response.json");
Gson gson = new Gson();
Response response = gson.fromJson(new FileReader(jsonFile), Response.class);

for (Item item : response.items)
{
System.out.println("[DEBUG] title: " + item.title);
System.out.println("[DEBUG] htmlTitle: " + item.htmlTitle);
System.out.println("[DEBUG] formattedUrl: " + item.formattedUrl);
System.out.println("[DEBUG] htmlFormattedUrl: " + item.htmlFormattedUrl);

// Assign values
title = item.title;
htmlTitle = item.htmlTitle;
formattedUrl = item.formattedUrl;
htmlFormattedUrl = item.htmlFormattedUrl;

Pagemap pagemap = item.pagemap;
Iterator iterMetatags ;

if(pagemap != null)
{
iterMetatags = pagemap.metatags.listIterator();

while(iterMetatags.hasNext())
{
Metatag element = iterMetatags.next();

System.out.println("[DEBUG] Metatag citation_doi :" + element.citation_doi);
System.out.println("[DEBUG] Metatag citation_arxiv_id :" + element.citation_arxiv_id);
System.out.println("[DEBUG] Metatag citation_author :" + element.citation_author);
System.out.println("[DEBUG] Metatag citation_date :" + element.citation_date);
System.out.println("[DEBUG] Metatag citation_title :" + element.citation_title);
System.out.println("[DEBUG] Metatag citation_pdf_url :" + element.citation_pdf_url);

System.out.println("[DEBUG] Metatag title :" + element.title);
System.out.println("[DEBUG] Metatag creator :" + element.creator);
System.out.println("[DEBUG] Metatag creationdate :" + element.creationdate);
System.out.println("[DEBUG] Metatag moddate :" + element.moddate);
System.out.println("[DEBUG] Metatag producer :" + element.producer);
System.out.println("[DEBUG] Metatag company :" + element.company);

System.out.println("[DEBUG] Metatag dc_title :" + element.dc_title);
System.out.println("[DEBUG] Metatag dc_description :" + element.dc_description);
System.out.println("[DEBUG] Metatag dc_creator :" + element.dc_creator)

}

}

}

}

//Mapping Json response to Java Object

class Response
{
List items;
}

class Item
{
String title;
String htmlTitle;
String formattedUrl;
String htmlFormattedUrl;
Pagemap pagemap;
}
class Pagemap
{
List metatags;
}
class Metatag
{
String citation_doi;
String citation_title;
String citation_author;
String citation_date;
String citation_pdf_url;
String citation_arxiv_id;

String title;
String creator;
String creationdate;
String moddate;
String producer;
String company;

@SerializedName("dc.title")
String dc_title;
@SerializedName("dc.description")
String dc_description;
@SerializedName("dc.creator")
String dc_creator;

}

}

Note that the annotation @SerializedName() is a field renaming technique that it is used to rename field names which are not java compatible.(e.g.:dc.creator)
Custom JSON Serialization and Deserialization  using GSON API

Sometimes default representation is not what you want. This is often the case when dealing with library classes (DateTime, etc). Gson allows you to register your own custom serializers and deserializers. This is done by defining two parts:

• Json Serialiers: Need to define custom serialization for an object
• Json Deserializers: Needed to define custom deserialization for a type Instance Creators: Not needed if no-args constructor is available or a deserializer is registered


GsonBuilder gson = new GsonBuilder();


registerTypeAdapter call checks if the type adapter implements more than one of these interfaces and register it for all of them. Writing a Serializer Here is an example of how to write a custom serializer for JodaTime DateTime class.


private class DateTimeSerializer implements JsonSerializer{
public JsonElement serialize(DateTime src, Type typeOfSrc, JsonSerializationContext context) {
return new JsonPrimitive(src.toString());
}
}


Gson calls toJson() when it runs into a DateTime object during serialization. Writing a Deserializer Here is an example of how to write a custom deserializer for JodaTime

DateTime class

private class DateTimeDeserializer implements JsonDeserializer{
public DateTime deserialize(JsonElement json, Type typeOfT, JsonDeserializationContext context)
throws JsonParseException {
return new DateTime(json.getAsJsonPrimitive().getAsString());
}
}

Gson calls fromJson() when it needs to deserialize a JSON string fragment into a DateTime object Finer points with Serializers and Deserializers.
Often you want to register a single handler for all generic types corresponding to a raw type. For example, suppose you have an "Id" class for Id representation/translation (i.e. an internal vs. external representation). Id type that has same serialization for all generic types.  Essentially write out the id value. Deserialization is very similar but not exactly the same.
You need to call "new Id(Class, String)" which returns an instance of Id Gson supports registering a single handler for this. You can also register a specific handler for a specific generic type (say Id needed special handling).
The Type parameter for the toJson and from Json contains the generic type information to help you write a single handler for all generic types corresponding to the same raw type