In order to extract textual content from a pdf it is probably more convenient to use an api like PDFTextStream. Generally the extraction is pretty simple but not exactly a 1-line solution. A code example is given below.
/** Get Content of PDF **/ PDFTextStreamConfig configuration=new PDFTextStreamConfig(); configuration.setTableDetectionEnabled(true); PDFTextStream stream = new PDFTextStream(new File(filePath)); StringBuilder text = new StringBuilder(); OutputTarget target = new OutputTarget(text); stream.pipe(target); content = target.getObject().toString();If we have to compare it with another similar API like Tika , i have to say that PDFTextStream is faster and it can give you more useful info. In particular Tika extracts raw textual information from a pdf, while PDFTextStream gives you structured text with correlated info such as character encoding, height, region of the text etc. The result is structured in Pages --> Blocks --> Lines --> TextUnits and below is an example of how we can get the pages from a pdf.
/** Get number of pages **/ int pageCnt = stream.getPageCnt(); for(int i = 0; i<=pageCount; i++){ Page page = stream.getPage(i); System.out.println("PAGE NR[" + i + "]:\n" + page); }In order to get the Blocks from a Page, the following code must be implemented. This will return the Blocks of a Page:
private List getBlocks(Page page) throws IOException { List listOfBlocks = new ArrayList(); int numBlockChilds = page.getTextContent().getChildCnt(); for (int i = 0; i < numBlockChilds; i++){ Block block = page.getTextContent().getChild(i); listOfBlocks.add(block); } return listOfBlocks; }
The final step in order to get raw textual information is to extract the text from the block:
private String block2txt(Block block) throws IOException { StringBuilder text = new StringBuilder(); OutputTarget target = new OutputTarget(text); block.pipe(target); String block_text = target.getObject().toString(); return block_text; }And this will return the text from a Block. Similarly we can navigate all content of the pdf and also extract the text from a Page or a Line.