Monday, March 17, 2014

Extract Text from PDF [PDFTextStream API]




    In order to extract textual content from a pdf it is probably more convenient to use an api like PDFTextStream. Generally the extraction is pretty simple but not exactly a 1-line solution. A code example is given below.

    /** Get Content of PDF **/
        PDFTextStreamConfig configuration=new PDFTextStreamConfig();
        configuration.setTableDetectionEnabled(true);
        PDFTextStream stream = new PDFTextStream(new File(filePath)); 
        StringBuilder text = new StringBuilder();
        OutputTarget  target = new OutputTarget(text);
        stream.pipe(target);
        content = target.getObject().toString();
    If we have to compare it with another similar API like Tika , i have to say that PDFTextStream is faster and it can give you more useful info. In particular Tika extracts raw textual information from a pdf, while PDFTextStream gives you structured text with correlated info such as character encoding, height, region of the text etc. The result is structured in  Pages --> Blocks --> Lines --> TextUnits  and below is an example of how we can get the pages from a pdf.
 
    /** Get number of pages **/ 
    int pageCnt = stream.getPageCnt();
    for(int i = 0; i<=pageCount; i++){
        Page page = stream.getPage(i);
        System.out.println("PAGE NR[" + i + "]:\n" + page);
    }
    In order to get the Blocks from a Page, the following code must be implemented. This will return the Blocks of a Page:
 
    private List getBlocks(Page page) throws IOException {
        List listOfBlocks = new ArrayList();
        int numBlockChilds = page.getTextContent().getChildCnt(); 
         
        for (int i = 0; i < numBlockChilds; i++){  
            Block block = page.getTextContent().getChild(i); 
            listOfBlocks.add(block); 
        }

        return listOfBlocks;  
 }

    The final step in order to get raw textual information is to extract the text from the block:

 
 private String block2txt(Block block) throws IOException {
  StringBuilder text = new StringBuilder();
                OutputTarget  target = new OutputTarget(text);
  block.pipe(target); 
  String block_text = target.getObject().toString();
  return block_text;
 }
    And this will return the text from a Block. Similarly we can navigate all content of the pdf and also extract the text from a Page or a Line.

1 comment:

  1. Hi,

    It is very useful.

    Any idea how to get the text formatting details such as Font size,bold,italc, style etc.

    Thanks,

    ReplyDelete