Monday, November 24, 2014

About PDFBOX..(Current version: 1.8.6)




Brief description:
The Apache PDFBox library is an open source Java tool for working with pdf documents. (Current version: 1.8.6)

Functionalities:
  • Extract text
    PDFBox is able to quickly and accurately extract text from a variety of PDF documents. This functionality is encapsulated in the org.apache.pdfbox.util.PDFTextStripper.

    Example:

    public class PDFReader{
       public static void main(String args[])
       {PDFTextStripper pdfStripper = null;
       PDDocument pdDoc = null;
       File file = new File("C:/my.pdf");
       try {
           PDFParser parser = new PDFParser(new FileInputStream(file));
           parser.parse();
           pdfStripper = new PDFTextStripper();
           pdDoc = new PDDocument(cosDoc);
           String parsedText = pdfStripper.getText(pdDoc);
       } catch (IOException e) {
           e.printStackTrace();
           }
       }
    }
  • Extract pages
    A single page OR a range of pages can be extracted from a pdf.

    Example:

       try {
           PDFParser parser = new PDFParser(new FileInputStream(file));
           parser.parse();
           pdfStripper = new PDFTextStripper();
           pdDoc = new PDDocument(cosDoc);
           pdfStripper.setStartPage(1);
           pdfStripper.setEndPage(5);
           String parsedText = pdfStripper.getText(pdDoc);
           System.out.println(parsedText);
       } catch (IOException e) {
           e.printStackTrace();}
  • Extract height, width ../ Extract font /..Extract bold, italic ..
    In order to extract information related to the fonts one has to “dig” into the API and override certain functions.
    Help from Stackoverflow
  • Extract tables
    Extracting tabular information is not implemented.
    (The wheel has to be reinvented)
    Related post in Stackoverflow
  • Extract regions of text../ Extract position of characters
    You can extract text by area in PDFBox.The problem is getting the coordinates in the first place. Asolution is to extend the normal TextStripper, overriding process TextPosition(TextPosition text) and printing out the coordinates for each character and figuring out where in the document they are.

    Example:

    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    stripper.setSortByPosition( true );
    Rectangle rect = new Rectangle( 500, 100, 55, 5);
    stripper.addRegion( "class1", rect );
    stripper.extractRegions( page );
    String string = stripper.getTextForRegion( "class1" );

Note: The documentation is very hectic and the resources are scarce. This is mainly because PDFBox is a low-level api.

No comments:

Post a Comment