Tuesday, October 1, 2013

Parsing pdf files (Tika 1.3)

  Armed with Tika, you can be confident of knowing each document’s "psyche", so sorting and organizing documents will be a piece of cake. The original and most important use case for Tika is extracting textual content from digital documents for use in building a full-text search index—which requires dealing with all of the different parsing toolkits out there—and representing text in a uniform way. We’ll start with a simple full-text extraction and indexing example based on the Tika facade, and we’ll also proceed to cover the Parser interface that’s the central abstraction for all the content extraction functionality in Tika.

 Full Text Extraction:

public String extractContentFromPDF(String filePath) throws IOException, SAXException, TikaException
  Parser parser = new PDFParser();    // PDFParser if you want to parse PDF files
  //Parser parser = new AutoDetectParser();  //Parser used for automatic detection 
  InputStream stream = TikaInputStream.get(new File(filePath)); 
  Metadata metadata = new Metadata();
  ParseContext context = new ParseContext();
  ContentHandler handler = new BodyContentHandler(1000000000);
  ContentHandler newHandler = new WriteOutContentHandler(handler, 1000000000); //acts like a "buffer"
  try {
   parser.parse(stream, newHandler, metadata, context);
  } finally {
  String dataString = handler.toString();
  return dataString;

  First the Tika facade will try to detect the given document’s media type. Once the type of the document is known, a matching parser implementation is looked up. The given document is then parsed (by the selected parser) using the function:

parse( InputStream, ContentHandler, Metadata, ParseContext )

  The PDFParser class is a wrapper around the advanced PDF parsing capabilities of the Apache PDFBox library, so it passes the example document to PDFBox and converts the returned metadata and text content to a format defined by Tika.

The parse() method:

  Information flows between the parse() method and its arguments. The input stream and metadata arguments are used as sources of document data, and the results of the parsing process are written out to the given content handler and metadata object. The context object is used as a source of generic context information from the client application to the parsing process. 

  • The document input stream—The raw byte stream of the document to be parsed is read from this input stream. 
  • XHTML SAX event handler—The structured text content of the input document is written to this handler as a semantic XHTML document. 
  • Document metadata—The metadata object is used both as a source and a target of document metadata. 
  • Context of the parsing process—This argument is used in cases where the client application wants to customize the parsing process by passing custom context information. 

Content Handlers:

  •  The BodyContentHandler class picks the body part of the XHTML output and redirects it to another ContentHandler instance.
  •  The LinkContentHandler class detects all a href="..." elements in the XHTML output and collects these hyperlinks for use by tools such as web crawlers. 
  •  The TeeContentHandler class delivers incoming SAX events to a collection of other event handlers, and can thus be used to easily process the parser output with multiple tools in parallel.
For example:

LinkContentHandler linkCollector = new LinkContentHandler();
OutputStream output = new FileOutputStream(new File(filename));
try {
ContentHandler handler = new TeeContentHandler(
new BodyContentHandler(output), linkCollector);
parser.parse(stream, handler, metadata, context);
} finally {

   There are a few cases where it’s useful to be able to pass extra context information to a parser. A ParseContext instance is a simple container object that maps interface declarations to objects that implement those interfaces.Sometimes a client application needs more direct control over the parsing process to implement custom processing of specific kinds of documents.

For example:

ParseContext context = new ParseContext();
context.set(Locale.class, Locale.ENGLISH);
parser.parse(stream, handler, metadata, context);
  Some documents like Microsoft Excel spreadsheets contain such binary data that needs to be rendered to text when outputted by Tika. Since the spreadsheet documents usually don’t specify the exact formatting or the output locale for such data, the parser needs to decide which locale to use.


  Last but not least Tika can extract not just the textual content from the files we encountered, but the metadata as well. Metadata can be defined as data about data and the metadata model has attributes, relationships between those attributes, and also information about the attributes, such as their formats, cardinality, definitions, and of course their names.
   Behind metadata management lurks a world of challenges, since in most real cases the quality metadata needs to be validated in order to be classified as trustworthy data and not spam..

  The purpose of this post was for you to grasp the idea of the overall design and functionality of the content extraction features in Tika.

No comments:

Post a Comment