Monday, November 18, 2013

Get TF and IDF of all the terms of an index [Lucene 4.3]



    Let's assume that you have indexed a number of documents with Lucene 4.3. The database created by Lucene is a "flat" database that has a number of fields for every document. Each field contains the terms of a document and their respective frequencies in a termVector*. For those who tried to migrate from older versions of Lucene extracting statistics like TF and IDF in Lucene 4.3 can seem a bit more tricky. Newer versions of Lucene are indeed less intuitive but on the other hand they are more flexible.
 
    Firstly, a reader must be initiated in order to access the index, and also a TFIDFSimilarity class that will help us calculate the frequencies(tf,idf), and a HashMap that will hold the scores (tf*idf).

         
         IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
         TFIDFSimilarity tfidfSIM = new DefaultSimilarity();
         Map<String, Float> tf_Idf_Weights = new HashMap<>();
          Map<String, Float> termFrequencies = new HashMap<>();
 
    Secondly, in order to get the terms of every document we must iterate through the enumeration of the terms and documents, respectively for every indexed document. Practically we iterate the enumerations for every document in the index :

 *Pay attention ! During indexing the termVectors must be stored.
 *The terms are stored in the index as Bytes*

Calculating the Inverse Document Frequencies: 

 Firstly, we create a Map for adding the idf values:

   
 Map<String, Float> docFrequencies = new HashMap<>();
The function below is field-specific and the value is calculated while looping through the termsEnum:
 
    /*** GET ALL THE IDFs ***/
   Map<String, Float> getIdfs(IndexReader reader, String field) throws IOException
    {
     /** GET FIELDS **/
        Fields fields = MultiFields.getFields(reader); //Get the Fields of the index 
 
        TFIDFSimilarity tfidfSIM = new DefaultSimilarity();
    
        for (String field: fields) 
        {
            TermsEnum termEnum = MultiFields.getTerms(reader, field).iterator(null);
            BytesRef bytesRef;
            while ((bytesRef = termEnum.next()) != null) 
            {
                if (termEnum.seekExact(bytesRef, true)) 
                {
                 String term = bytesRef.utf8ToString(); 
                    float idf = tfidfSIM.idf( termEnum.docFreq(), reader.numDocs() );
                    docFrequencies.put(term, idf);      
                }
            }
        }

  return docFrequencies;
    }
In particular the Lucene function that we use to get  the inverse document frequency is:    

   tfidfSIM.idf(termEnum.docFreq(), reader.numDocs())

It practically computes a score factor based on a term's document frequency (the number of documents which contain the term). This value is multiplied by the tf(int) factor for each term in the query.

Calculating the Term Frequencies:
 
for (int docID=0; docID< reader.maxdoc(); docID++)
{      
        TermsEnum termsEnum = MultiFields.getTerms(reader, field).iterator(null);
        DocsEnum docsEnum = null;
 
        Terms vector = reader.getTermVector(docId, CONTENT);
   
        try
        {
   termsEnum = vector.iterator(termsEnum);
  } 
        catch (NullPointerException e) 
  {
   e.printStackTrace();
  }
        BytesRef bytesRef = null;
        while ((bytesRef = termsEnum.next()) != null) 
        {
         if (termsEnum.seekExact(bytesRef, true)) 
         {
          String term = bytesRef.utf8ToString(); 
          float tf = 0; 
          
                docsEnum = termsEnum.docs(null, null, DocsEnum.FLAG_FREQS);
                while (docsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) 
                {
                 tf =  tfidfSIM.tf(docsEnum.freq()); 
                    termFrequencies.put(term, tf); 
                }  
                
                float idf = docFrequencies.get(term);
                float w = tf * idf;
                tf_Idf_Weights.put(term, w); 
         } 
        } 
 
        return tf_Idf_Weights;
}
    Lucene has an inverted index data structure which means that the process to find the term frequencies for every document is not as direct as one can think. The reason for that is that the inverted index stores a list of fields(for every docId) containing each term and its frequency throughout the documents. This means that we can easily retrieve the number of matching documents for a certain term, but in order for us to get the tf we proceed by iterating through DocEnum and calling
tf = tfidfSIM.tf(docsEnum.freq());
for every term.

  * The function freq() returns the term frequency in the current document *

    After calculating and adding the tf  and  idf of every term, we can get the weight ( w = tf * idf ) and store it. This way we can create a vector for each document that will contain the respective weights of its terms, and therefore we can calculate the distance between vectors.

If you want to take a look at a complete java class implementation of this functionality check out this post.

Tuesday, October 1, 2013

Parsing pdf files (Tika 1.3)



  Armed with Tika, you can be confident of knowing each document’s "psyche", so sorting and organizing documents will be a piece of cake. The original and most important use case for Tika is extracting textual content from digital documents for use in building a full-text search index—which requires dealing with all of the different parsing toolkits out there—and representing text in a uniform way. We’ll start with a simple full-text extraction and indexing example based on the Tika facade, and we’ll also proceed to cover the Parser interface that’s the central abstraction for all the content extraction functionality in Tika.

 Full Text Extraction:

public String extractContentFromPDF(String filePath) throws IOException, SAXException, TikaException
 { 
  Parser parser = new PDFParser();    // PDFParser if you want to parse PDF files
  //Parser parser = new AutoDetectParser();  //Parser used for automatic detection 
  
  InputStream stream = TikaInputStream.get(new File(filePath)); 
  Metadata metadata = new Metadata();
  ParseContext context = new ParseContext();
  ContentHandler handler = new BodyContentHandler(1000000000);
  
  ContentHandler newHandler = new WriteOutContentHandler(handler, 1000000000); //acts like a "buffer"
  
  try {
   parser.parse(stream, newHandler, metadata, context);
  } finally {
   stream.close();
  }
  
  String dataString = handler.toString();
   
  return dataString;
 }

  First the Tika facade will try to detect the given document’s media type. Once the type of the document is known, a matching parser implementation is looked up. The given document is then parsed (by the selected parser) using the function:

parse( InputStream, ContentHandler, Metadata, ParseContext )

  The PDFParser class is a wrapper around the advanced PDF parsing capabilities of the Apache PDFBox library, so it passes the example document to PDFBox and converts the returned metadata and text content to a format defined by Tika.

The parse() method:

  Information flows between the parse() method and its arguments. The input stream and metadata arguments are used as sources of document data, and the results of the parsing process are written out to the given content handler and metadata object. The context object is used as a source of generic context information from the client application to the parsing process. 

  • The document input stream—The raw byte stream of the document to be parsed is read from this input stream. 
  • XHTML SAX event handler—The structured text content of the input document is written to this handler as a semantic XHTML document. 
  • Document metadata—The metadata object is used both as a source and a target of document metadata. 
  • Context of the parsing process—This argument is used in cases where the client application wants to customize the parsing process by passing custom context information. 


Content Handlers:


  •  The BodyContentHandler class picks the body part of the XHTML output and redirects it to another ContentHandler instance.
  •  The LinkContentHandler class detects all a href="..." elements in the XHTML output and collects these hyperlinks for use by tools such as web crawlers. 
  •  The TeeContentHandler class delivers incoming SAX events to a collection of other event handlers, and can thus be used to easily process the parser output with multiple tools in parallel.
For example:

LinkContentHandler linkCollector = new LinkContentHandler();
OutputStream output = new FileOutputStream(new File(filename));
try {
ContentHandler handler = new TeeContentHandler(
new BodyContentHandler(output), linkCollector);
parser.parse(stream, handler, metadata, context);
} finally {
output.close();
}
ParseContext:

   There are a few cases where it’s useful to be able to pass extra context information to a parser. A ParseContext instance is a simple container object that maps interface declarations to objects that implement those interfaces.Sometimes a client application needs more direct control over the parsing process to implement custom processing of specific kinds of documents.

For example:

ParseContext context = new ParseContext();
context.set(Locale.class, Locale.ENGLISH);
parser.parse(stream, handler, metadata, context);
  Some documents like Microsoft Excel spreadsheets contain such binary data that needs to be rendered to text when outputted by Tika. Since the spreadsheet documents usually don’t specify the exact formatting or the output locale for such data, the parser needs to decide which locale to use.


Metadata:

  Last but not least Tika can extract not just the textual content from the files we encountered, but the metadata as well. Metadata can be defined as data about data and the metadata model has attributes, relationships between those attributes, and also information about the attributes, such as their formats, cardinality, definitions, and of course their names.
   Behind metadata management lurks a world of challenges, since in most real cases the quality metadata needs to be validated in order to be classified as trustworthy data and not spam..

  The purpose of this post was for you to grasp the idea of the overall design and functionality of the content extraction features in Tika.



Tuesday, September 24, 2013

Implementation of TF/IDF (Term Frequency-Inverse Document Frequency) in JAVA

    
     Tf–Idf is the product of two statistics, term frequency and inverse document frequency. Various ways for determining the exact values of both statistics exist. In the case of the term frequency tf(t,d), the simplest choice is to use the raw frequency of a term in a document, i.e. the number of times that term t occurs in document d. If we denote the raw frequency of t by f(t,d), then the simple Tf scheme is tf(t,d) = f(t,d).




The inverse document frequency is a measure of whether the term is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient .

 
Where:
  |D|   is the cardinality, or the total number of documents in the corpus and respectively
 |\{d \in D: t \in d\}| is the number of documents where the term appears.

  Then tf–idf is calculated as:




A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf-idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0.


NOTE: For a deeper dive into theory read this.


Practically..

A possible implementation (written in Java) for calculating Tf/Idf of the terms of a document is given below:


public class Tf_Idf 
{
     /**
      * Calculated the tf of term termToCheck
      * @param totalterms : Array of all the words under processing document
      * @param termToCheck : term of which tf is to be calculated.
      * @return tf(term frequency) of term termToCheck
      */
     public double tfCalculator(List totalterms, String termToCheck) 
     {
         double count = 0;  
         for (String s : totalterms) 
         {
             if (s.equalsIgnoreCase(termToCheck))
                 count++;
         }
         return count / totalterms.length;
     }
     
     /**
      * Calculated idf of term termToCheck
      * @param allTerms : all the terms of all the documents
      * @param termToCheck
      * @return idf(inverse document frequency) score
      */
     public double idfCalculator(List<> allTerms, String termToCheck) 
     {
         double count = 0;
         for (String[] ss : allTerms)
         {
             for (String s : ss) 
             {
                 if (s.equalsIgnoreCase(termToCheck))
                 {
                     count++;
                     break;
                 }
             }
         }
         return Math.log(allTerms.size() / count);
     }   
 
}
*If you want to know more about tf/idf and its applications check out the Cosine similarity example here!


Implementation of Cosine Similarity [JAVA and Python Example]





Given two vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product and magnitude as:



   This metric is frequently used when trying to determine similarity between two documents. In this similarity metric, the attributes (or words, in the case of the documents) is used as a vector to find the normalized dot product of the two documents. By determining the cosine similarity, the user is effectively trying to find cosine of the angle between the two objects. For cosine similarities resulting in a value of 0, the documents do not share any attributes (or words) because the angle between the objects is 90 degrees.

   Below there are two possible implementations written in Java and in Python which you might find handy if you want to develop your own projects.


Java implementation:


public class CosineSimilarity 
{
  /**
     * Method to calculate cosine similarity between two documents.
     * @param docVector1 : document vector 1 (a)
     * @param docVector2 : document vector 2 (b)
     * @return 
     */
    public double cosineSimilarity(double[] docVector1, double[] docVector2) 
    {
        double dotProduct = 0.0;
        double magnitude1 = 0.0;
        double magnitude2 = 0.0;
        double cosineSimilarity = 0.0;

        for (int i = 0; i < docVector1.length; i++) //docVector1 and docVector2 must be of same length
        {
            dotProduct += docVector1[i] * docVector2[i];  //a.b
            magnitude1 += Math.pow(docVector1[i], 2);  //(a^2)
            magnitude2 += Math.pow(docVector2[i], 2); //(b^2)
        }

        magnitude1 = Math.sqrt(magnitude1);//sqrt(a^2)
        magnitude2 = Math.sqrt(magnitude2);//sqrt(b^2)

        if (magnitude1 != 0.0 | magnitude2 != 0.0)
        {
            cosineSimilarity = dotProduct / (magnitude1 * magnitude2);
        } 
        else 
        {
            return 0.0;
        }
        return cosineSimilarity;
    }
}
    *In addition you should know how to implement the tf/idf (term frequency-inverse document frequency) for every term(word) of the document , since those are the values of the attributes A and B. You can find a mathematical explanation of tf/idf with an example written in Java here!

Python implementation:



# Input: 2 vectors
# Output: the cosine similarity
# !!! Untested !!!
def cosine_similarity(vector1,vector2):
  # Calculate numerator of cosine similarity
  dot = [vector1[i] * vector2[i] for i in range(vector1)]
  
  # Normalize the first vector
  sum_vector1 = 0.0
  sum_vector1 += sum_vector1 + (vector1[i]*vector1[i] for i in range(vector1))
  norm_vector1 = sqrt(sum_vector1)
  
  # Normalize the second vector
  sum_vector2 = 0.0
  sum_vector2 += sum_vector2 + (vector2[i]*vector2[i] for i in range(vector2))
  norm_vector2 = sqrt(sum_vector2)
  
  return (dot/(norm_vector1*norm_vector2))

Monday, September 23, 2013

The importance of being Semantic

Aristotelian Triangle


“Aristotle teaches us what names and words designate and that from one side there is mental representations (Noemata) and from the other side the process of naming and designation is being realized by the means of a designator (Subject) and an object and that one should not add any kind of intermediate element between the thought and the object.”

  In other words, it is inherent in people to conceptualize the meaning of a symbol and the object that it refers to. This process relies on subjective reasoning(on some extend) and thus may not be universal, but nevertheless is common ground among all thinking animals. On a higher level logic-thinking animals (like humans) may not only understand the meaning of a term but may also "rank" it when confronting it with other terms. This demonstrates the capability of logic-thinking animals to reason and deduce facts implicitly. On the other hand computers may have been evolved in a logarithmic manner and may have strong computing capabilities but still lack of certain aspects that would classify them as intelligent thinkers. Since computers can religiously follow only rules, the Aristotelian Triangle is of no use in this case.

  Semantics do not mean anything

  Lazy but big steps towards a semantic universe have led us to a series of techniques and theories that can potentially give us the possibility to build an actual thinking machine. Computational linguistics and natural language processing, collaborative filtering, models developed using data mining and machine learning algorithms, AI, processes of filtering for information or patterns, methods of making automatic predictions and decisions etc.. All of the above techniques are directly (or indirectly) trying to resolve the problem of semantics.

Terms, Words, Phrases, Texts, CORPUS..

  The ability of correlating terms, or words, or phrases, or even whole texts is something that can be taught to computers in an efficient(more or less) way. Given a collection of documents of the same semantic context(corpus) a practical way to do this is by giving a semantic "weight" at each term of every single document. By measuring the "weights" of each term and confronting them, we can achieve a certain classification that can be further exploited in order to measure the semantic "distance" between terms. The weighting factor is a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is called term frequency–inverse document frequency (tf–idf), which is a value that increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus. This may look like a paradox, but actually it helps to control for the fact that some words are generally more common than others which means that they are less significant and can not be taken into consideration.


PRACTICAL  NOTE*
  • Term Frequency: Suppose for a document there are overall 5000 words and a word Term-Frequency occurs 5 times. Then , mathematically, its Term Frequency, TF = 5/5000 =0.001.
  • Inverse Document Frequency: Suppose one bought 1Q84 series, all series. Suppose the there are 9000000000000 total words and a word "senshei" comes 70000 times in all of the series. Then, mathematically, its Inverse-Document Frequency ,IDF = log(9000000000000/70000) = .......(i am sure it gives as a number.) 
         And finally,   TF/IDF = TF * IDF;


"To get to know each other.."

  The evaluation of the semantic proximity is done using the Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. Each term is notionally assigned a different dimension and a document is characterized by a vector where the value of each dimension corresponds to the tf–idf of the term. Cosine similarity then gives a useful measure of how similar two documents are likely to be in terms of their subject matter. The formula is given below and in our case A and B represent a document containing the respective tf-idf of the terms.


A more mathematical  explanation of the Cosine similarity and an example written in Java and Python can be found here.

-Just imagine a future in which machines will be able to conceptualize the meaning of death.




Wednesday, September 18, 2013

Extraction of links from an HTML document + Example (Jsoup 1.7.2)


    This is an example of how we can parse an HTML document and extract all the (valid) links contained in the BODY element. The "tool" used for the manipulation and extraction is Jsoup(version 1.7.2).

    The power of this API lies within the simplicity of the implementation. As a matter of fact we can efficiently extract any information in just a few lines of code. In primis an HTML File is read and parsed.

File htmlFile = new File("../workspace/HtmlParserJsoup/sample.html");
Document doc = Jsoup.parse(htmlFile, "UTF-8");
    As long as you pass in a non-null string, the output is a Document containing (at least) a head and a body element.(watch out .. the reference is @import org.jsoup.nodes.Document; )

    In order to extract the http links contained in the body element we initially find all the elements with the a tag, and then a loop is performed in order to check the validity of the links. The logic is pretty straight forward such as the flow of the program.

/*** GET ALL LINKS OF BODY ***/
Elements links = body.getElementsByTag("a");
for (Element link : links) 
{
   String linkHref = link.attr("href");
   String linkText = link.text();
   if((linkHref.length()>5)&&(linkHref.contains("http")))
      System.out.println("\n[DEBUG] linkHref : " + linkHref);
}
At the end you should have nothing else but valid http links ..

 

Parse HTML with JSOUP API

filotechnologia

  This article is intented for those who want to manipulate and/or extract data from html pages.
For this purpose the jsoup api is being used, which is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

 This library implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

The core functionality of this api is displayed in the next few paragraphs along with some simple examples. Generally jsoup can.. :
  • Scrape and parse HTML from a URL, file, or string 
Example:

File htmlFile = new File("../workspace/HtmlParserJsoup/GET_RESPONSE.html"); 
Document doc = Jsoup.parse(htmlFile, "UTF-8"); 

  • Find and extract data, using DOM traversal or CSS selectors
Example :

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
  // img with src ending .png
Element masthead = doc.select("div.masthead").first();
  // div with class=masthead
Elements resultLinks = doc.select("h3.r > a"); // direct a after h3

  • Manipulate the HTML elements, attributes, and text
Example:
Element div = doc.select("div").first(); // 
div.html("lorem ipsum"); // lorem ipsum div.prepend("First"); div.append("Last"); // now: First lorem ipsum Last Element span = doc.select("span").first(); // One span.wrap(""); // now:One

  •  Clean user-submitted content against a safe white-list, to prevent XSS attacks
Example:
String unsafe ="Link";
String safe = Jsoup.clean(unsafe, Whitelist.basic());

  •  output tidy HTML

UNAGI:

  Jsoup is an open source project distributed under the liberal MIT license. The source code is available at GitHub. You can download the latest version .jar here.
An example of HTML parsing can be found here.
Enjoy!   :-)


Monday, September 16, 2013

Google search engine response(JSON) structure


  This document is intended for developers who want to write applications that can interact with the JSON/Atom Custom Search API. We will analyze the structure of the response that we received after an http GET request to google search engine.

Response data:
{
 "kind": "customsearch#search",
 "url": {
  "type": "application/json",
  "template": "https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={.},
 "queries": {
  "nextPage": [
   {
    "title": "Google Custom Search - stars",
    "totalResults": "309000",
    "searchTerms": "stars",
    "count": 10,
    "startIndex": 11,
    "inputEncoding": "utf8",
    "outputEncoding": "utf8",
    "safe": "off",
    "cx": "002551901808508802477:39t74qehyw0"
   }
  ],
  "request": [
   {
    "title": "Google Custom Search - stars",
    "totalResults": "309000",
    "searchTerms": "stars",
    "count": 10,
    "startIndex": 1,
    "inputEncoding": "utf8",
    "outputEncoding": "utf8",
    "safe": "off",
    "cx": "002551901808508802477:39t74qehyw0"
   }
  ]
 },
 "context": {
  "title": "Customized.Searching"
 },
 "searchInformation": {
  "searchTime": 0.29926,
  "formattedSearchTime": "0.30",
  "totalResults": "309000",
  "formattedTotalResults": "309,000"
 },
 "items": [
  {
   "kind": "customsearch#result",
   "title": "Formation of the First Stars",
   "htmlTitle": "Formation of the First \u003cb\u003eStars\u003c/b\u003e",
   "link": "http://arxiv.org/abs/1305.5178",
   "displayLink": "arxiv.org",
   "snippet": "May 22, 2013 ... Abstract: Understanding the formation of the...", 
   "htmlSnippet": "May 22, 2013 \u003cb\u003e...Abstract: Unde...", 
   "cacheId": "CbjG-txlc30J",
   "formattedUrl": "arxiv.org/abs/1305.5178",
   "htmlFormattedUrl": "arxiv.org/abs/1305.5178",
   "pagemap": {
    "metatags": [
     {
      "citation_title": "Formation of the First Stars",
      "citation_author": "Bromm, Volker",
      "citation_date": "2013/05/22",
      "citation_online_date": "2013/05/22",
      "citation_pdf_url": "http://arxiv.org/pdf/1305.5178",
      "citation_arxiv_id": "1305.5178"
     }
    ]
   }
  },
  .
  .
  .
  .
  {
   "kind": "customsearch#result",
   "title": "[1309.0096] Quark deconfinement transition in neutron stars...", 
        "htmlTitle": "[1309.0096] Quark deconfinement transition in neutron 0096", 
        "displayLink": "arxiv.org",
"snippet": "Aug 31, 2013 ... Abstract: A phaseof stronginteracting marwi...", 
        "htmlSnippet": "Aug 31, 2013 \u003cb\u003e...\u003c/b\u003e Abstract: Aphase", 
        "cacheId": "_eXZTeanAq8J",
   "formattedUrl": "arxiv.org/abs/1309.0096",
   "htmlFormattedUrl": "arxiv.org/abs/1309.0096",
   "pagemap": {
    "metatags": [
     {
      "citation_title": "Quark deconfinement transition in neutr", 
      "citation_author": "Logoteta, Domenico",
      "citation_date": "2013/08/31",
      "citation_online_date": "2013/08/31",
      "citation_pdf_url": "http://arxiv.org/pdf/1309.0096",
      "citation_arxiv_id": "1309.0096"
     }
    ]
   }
  }
 ]
}
As shown in the sample output above, the response data includes three main classes of properties:

  • Search metadata 
  • Custom search engine metadata
  • Search results
  The main classes of data are described below:

 Search metadata 

 The search metadata includes the url property, which has information about the OpenSearch template used for the results returned in this request. It also includes a queries property, which is an array of objects describing the characteristics of possible searches. The name of each object in the array is either the name of an OpenSearch query role or one of the two custom roles defined by this API: previousPage and nextPage .


 Each query role object is itself an array, though usually the array contains a single element. Possible query role objects include:

request: Metadata describing the query for the current set of results. This role is always present in the response. It is always an array with just one element.
nextPage: Metadata describing the query to use for the next page of results. This role is not present if the current results are the last page. Note: This API returns up to the first 100 results only. When present, it is always a array with just one element.
previousPage: Metadata describing the query to use for the previous page of results. Not present if the current results are the first page. When present, it is always a array with just one element.

 Note: The totalResults property in the objects above identifies the estimated total number of results for the search, which may not be accurate.

 Custom search engine metadata 

 The context property has metadata describing the search engine that performed the search query. It includes the name of the search engine, and any facet objects it provides for refining a search.

 Search results 

 The items array contains the actual search results. The search results include the URL, title and text snippets that describe the result. In addition, they can contain pagemap information, if applicable. If the search results include a promotions property, it contains a set of promotions.





Convert Java Object From JSON (Gson API)



  In this article, we show you an example of how to use Gson, JSON library, to convert json to object. Gson API is easy to learn and implement, what we need to know are the following two methods:


  •  toJson() – Convert Java object to JSON format 
  •  fromJson() – Convert JSON into Java object 


 We suppose that an http Get request is made to Google search engine. The response from google is in json format and a part of this response is presented here:
 
 .
 .
 .
"items": [
  {
   "kind": "customsearch#result",
   "title": "Formation of the First Stars",
   "htmlTitle": "Formation of the First \u003cb\",
   "link": "http://arxiv.org/abs/1305.5178",
   "displayLink": "arxiv.org",
   "snippet": "May 22, 2013 ... Abstract: Understanding the formation of ...",
   "htmlSnippet": "May 22, 2013 \u003cb\u003e...",
   "cacheId": "CbjG-txlc30J",
   "formattedUrl": "arxiv.org/abs/1305.5178",
   "htmlFormattedUrl": "arxiv.org/abs/1305.5178",
   "pagemap": {
    "metatags": [
     {
      "citation_title": "Formation of the First Stars",
      "citation_author": "Bromm, Volker",
      "citation_date": "2013/05/22",
      "citation_online_date": "2013/05/22",
      "citation_pdf_url": "http://arxiv.org/pdf/1305.5178",
      "citation_arxiv_id": "1305.5178"
     }
    ]
   }
  },
  {
   "kind": "customsearch#result",
   "title": "Fast radio bursts: the last sign of supramassive neutron stars",
   "htmlTitle": "Fast radio bursts: the last sign of supramassive neutron",
   "link": "http://arxiv.org/abs/1307.1409",
   "displayLink": "arxiv.org",
   "snippet": "Jul 4, 2013 ... Abstract: Several fast radio bursts , ...",
   "htmlSnippet": "Jul 4, 2013 \u003cb\u003e...\u003c/b\u003e Abstract:....",
   "cacheId": "kcClCUGrI9IJ",
   "formattedUrl": "arxiv.org/abs/1307.1409",
   "htmlFormattedUrl": "arxiv.org/abs/1307.1409",
   "pagemap": {
    "metatags": [
     {
      "citation_title": "Fast radio bursts: the last sign of supramassive..",
      "citation_author": "Falcke, Heino",
      "citation_date": "2013/07/04",
      "citation_online_date": "2013/07/04",
      "citation_pdf_url": "http://arxiv.org/pdf/1307.1409",
      "citation_arxiv_id": "1307.1409"
     }
    ]
   }
  },……]
 .
 .
 .
Generally the response data includes three main classes of properties:

 • Search metadata
 • Custom search engine metadata
 • Search results

 As shown in the sample output above The items array contains the actual search results. The search results include the URL, title and text snippets that describe the result.
*In addition, they can contain pagemap information, if applicable. The metatags array contains the available metadata that come with the search result. A sample of metadata is given as an example below:

JSON METATAGS SAMPLE
             

      .
      .
   citation_language
   prism.publicationname: "Icarus"
   prism.issn: "0019-1035"
   prism.volume: "216"
   prism.startingpage: "136"
   prism.endingpage: "140"
   dc.language: "en"
   dc.identifier: "doi:10.1016/j.icarus.2011.08.022"
   dc.date: "2011-11"
   dc.source: "Icarus, Volume 216, Issue 1, p. 136-140."
   eprints.creators_name: "Barnes, Jason W."
   eprints.creators_id: "Barnes-J-W"
   eprints.type: "article"
   eprints.datestamp: "2011-12-19 19:04:30"
   eprints.lastmod: "2011-12-19 19:04:30"
   eprints.metadata_visibility: "show"
   eprints.title: "Organic sedimentary deposits in Titan’s dry lakebeds: Probable evaporite"
   eprints.ispublished: "pub"
   eprints.full_text_status: "none"
   eprints.keywords: "Titan"
   eprints.date: "2011-11"
   eprints.date_type: "published"
   eprints.publication: "Icarus"
   eprints.volume: "216"
   eprints.number: "1"
   eprints.publisher: "Elsevier"
     .
     .
  
JAVA CLASS:

  The java class reads data from response.json ,then “maps” them back to a java object and displays it.                                                                                                                       *Note:Dependencies are omitted
 
public class CustomSearch {
 
 static private String title= new String();
 static private String htmlTitle= new String();
 static private String link= new String();
 static private String formattedUrl= new String();
 static private String htmlFormattedUrl= new String();
 static private List metatagList= new ArrayList();
 

 
 public static void main(String[] args) throws IOException
 {
    
     File jsonFile = new File("C:\\..\\response.json");
     Gson gson = new Gson();
     Response response = gson.fromJson(new FileReader(jsonFile), Response.class);
 
     for (Item item : response.items)
     {      
       System.out.println("[DEBUG] title: " + item.title);
       System.out.println("[DEBUG] htmlTitle: " + item.htmlTitle);
       System.out.println("[DEBUG] link: " + item.link);
       System.out.println("[DEBUG] formattedUrl: " + item.formattedUrl);
       System.out.println("[DEBUG] htmlFormattedUrl: " + item.htmlFormattedUrl);
     
       // Assign values
       title = item.title;
       htmlTitle = item.htmlTitle;
       link = item.link;
       formattedUrl = item.formattedUrl;
       htmlFormattedUrl = item.htmlFormattedUrl;
 
       Pagemap pagemap = item.pagemap;
       Iterator iterMetatags ;
     
       if(pagemap != null)
       {
        iterMetatags = pagemap.metatags.listIterator();
     
        while(iterMetatags.hasNext())
        {
         Metatag element = iterMetatags.next();

         System.out.println("[DEBUG] Metatag citation_doi :" + element.citation_doi);
         System.out.println("[DEBUG] Metatag citation_arxiv_id :" + element.citation_arxiv_id);
         System.out.println("[DEBUG] Metatag citation_author :" + element.citation_author);
         System.out.println("[DEBUG] Metatag citation_date :" + element.citation_date);
         System.out.println("[DEBUG] Metatag citation_title :" + element.citation_title);
         System.out.println("[DEBUG] Metatag citation_pdf_url :" + element.citation_pdf_url);
       
         System.out.println("[DEBUG] Metatag title :" + element.title);
         System.out.println("[DEBUG] Metatag creator :" + element.creator);
         System.out.println("[DEBUG] Metatag creationdate :" + element.creationdate);
         System.out.println("[DEBUG] Metatag moddate :" + element.moddate);
         System.out.println("[DEBUG] Metatag producer :" + element.producer);
         System.out.println("[DEBUG] Metatag company :" + element.company);
       
         System.out.println("[DEBUG] Metatag dc_title :" + element.dc_title);
         System.out.println("[DEBUG] Metatag dc_description :" + element.dc_description);
         System.out.println("[DEBUG] Metatag dc_creator :" + element.dc_creator)    
       
         /* Add metadata values */
         metatagList.add(element.citation_doi);
         metatagList.add(element.citation_arxiv_id);
         metatagList.add(element.citation_author);
         metatagList.add(element.citation_date);
         metatagList.add(element.citation_title);
         metatagList.add(element.citation_pdf_url);
       
         metatagList.add(element.title);
         metatagList.add(element.creator);
         metatagList.add(element.creationdate);
         metatagList.add(element.moddate);
         metatagList.add(element.producer);
         metatagList.add(element.company);
       
         metatagList.add(element.dc_title);
         metatagList.add(element.dc_description);
         metatagList.add(element.dc_creator);
        }     
     
       }
     
     
     }

 
 }
 
 //Mapping Json response to Java Object  
 
 class Response
 {
  List items;
 }
 
 class Item
 {
  String title;
  String htmlTitle;
  String link;
  String formattedUrl;
  String htmlFormattedUrl;
  Pagemap pagemap;
 }
 class Pagemap
 {
  List metatags;
 }
 class Metatag
 {
  String citation_doi;
  String citation_title;
  String citation_author;
  String citation_date;
  String citation_pdf_url;
  String citation_arxiv_id;
  
  String title;
  String creator;
  String creationdate;
  String moddate;
  String producer;
  String company;
  
  @SerializedName("dc.title")
  String dc_title;
  @SerializedName("dc.description")
  String dc_description;
  @SerializedName("dc.creator")
  String dc_creator;
  
 }
 
}
Note that the annotation @SerializedName() is a field renaming technique that it is used to rename field names which are not java compatible.(e.g.:dc.creator)
Custom JSON Serialization and Deserialization  using GSON API

  Sometimes default representation is not what you want. This is often the case when dealing with library classes (DateTime, etc). Gson allows you to register your own custom serializers and deserializers. This is done by defining two parts:

  • Json Serialiers: Need to define custom serialization for an object 
  • Json Deserializers: Needed to define custom deserialization for a type Instance Creators: Not needed if no-args constructor is available or a deserializer is registered


 
GsonBuilder gson = new GsonBuilder();
gson.registerTypeAdapter(MyType2.class, new MyTypeAdapter());
gson.registerTypeAdapter(MyType.class, new MySerializer());
gson.registerTypeAdapter(MyType.class, new MyDeserializer());
gson.registerTypeAdapter(MyType.class, new MyInstanceCreator());

registerTypeAdapter call checks if the type adapter implements more than one of these interfaces and register it for all of them. Writing a Serializer Here is an example of how to write a custom serializer for JodaTime DateTime class.

 
private class DateTimeSerializer implements JsonSerializer{
  public JsonElement serialize(DateTime src, Type typeOfSrc, JsonSerializationContext context) {
    return new JsonPrimitive(src.toString());
  }
}

 Gson calls toJson() when it runs into a DateTime object during serialization. Writing a Deserializer Here is an example of how to write a custom deserializer for JodaTime

DateTime class
 
private class DateTimeDeserializer implements JsonDeserializer{
  public DateTime deserialize(JsonElement json, Type typeOfT, JsonDeserializationContext context)
      throws JsonParseException {
    return new DateTime(json.getAsJsonPrimitive().getAsString());
  }
}

Gson calls fromJson() when it needs to deserialize a JSON string fragment into a DateTime object Finer points with Serializers and Deserializers.
 Often you want to register a single handler for all generic types corresponding to a raw type. For example, suppose you have an "Id" class for Id representation/translation (i.e. an internal vs. external representation). Id type that has same serialization for all generic types.  Essentially write out the id value. Deserialization is very similar but not exactly the same.
  You need to call "new Id(Class, String)" which returns an instance of Id Gson supports registering a single handler for this. You can also register a specific handler for a specific generic type (say Id needed special handling). 
  The Type parameter for the toJson and from Json contains the generic type information to help you write a single handler for all generic types corresponding to the same raw type