Monday, November 24, 2014

About PDFBOX..(Current version: 1.8.6)

Brief description:
The Apache PDFBox library is an open source Java tool for working with pdf documents. (Current version: 1.8.6)

  • Extract text
    PDFBox is able to quickly and accurately extract text from a variety of PDF documents. This functionality is encapsulated in the org.apache.pdfbox.util.PDFTextStripper.


    public class PDFReader{
       public static void main(String args[])
       {PDFTextStripper pdfStripper = null;
       PDDocument pdDoc = null;
       File file = new File("C:/my.pdf");
       try {
           PDFParser parser = new PDFParser(new FileInputStream(file));
           pdfStripper = new PDFTextStripper();
           pdDoc = new PDDocument(cosDoc);
           String parsedText = pdfStripper.getText(pdDoc);
       } catch (IOException e) {
  • Extract pages
    A single page OR a range of pages can be extracted from a pdf.


       try {
           PDFParser parser = new PDFParser(new FileInputStream(file));
           pdfStripper = new PDFTextStripper();
           pdDoc = new PDDocument(cosDoc);
           String parsedText = pdfStripper.getText(pdDoc);
       } catch (IOException e) {
  • Extract height, width ../ Extract font /..Extract bold, italic ..
    In order to extract information related to the fonts one has to “dig” into the API and override certain functions.
    Help from Stackoverflow
  • Extract tables
    Extracting tabular information is not implemented.
    (The wheel has to be reinvented)
    Related post in Stackoverflow
  • Extract regions of text../ Extract position of characters
    You can extract text by area in PDFBox.The problem is getting the coordinates in the first place. Asolution is to extend the normal TextStripper, overriding process TextPosition(TextPosition text) and printing out the coordinates for each character and figuring out where in the document they are.


    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    stripper.setSortByPosition( true );
    Rectangle rect = new Rectangle( 500, 100, 55, 5);
    stripper.addRegion( "class1", rect );
    stripper.extractRegions( page );
    String string = stripper.getTextForRegion( "class1" );

Note: The documentation is very hectic and the resources are scarce. This is mainly because PDFBox is a low-level api.

Thursday, July 10, 2014

View in GWT using the UiBinder technique.

This is an implementation of building a View in GWT using the UiBinder technique. In part 1 you can find the basic setup for this project. 


SearchPresenter (Class)
package helloworld.client.view;
import helloworld.client.presenter.SearchPresenter;
public interface SearchView {
 public interface Presenter { }
 void setPresenter(SearchPresenter presenter);
 Widget asWidget();

SearchView (Interface)

package helloworld.client.presenter;
import helloworld.client.view.SearchView;
public class SearchPresenter implements Presenter, SearchView.Presenter {

    private final SearchView view;

    /** Constructor **/
    public SearchPresenter(SearchView view) {
        this.view = view;

    public final void bind() {

     * SET VIEW *

    public final void go(final HasWidgets container) {

*The "owner" class is instantiated inside the function onModuleLoad().

SearchViewImpl ("Owner" Class)
package helloworld.client.view;

import *HINT:Hit {ctrl+shft+O} if you use Eclipse*

public class SearchViewImpl extends Composite implements SearchView {

 @UiField TextBox searchBox;
 @UiField Button buttonSubmit;
 private static SearchViewUiBinder uiBinder = GWT.create(SearchViewUiBinder.class);

 interface SearchViewUiBinder extends UiBinder { }

 public SearchViewImpl() {
  * Connect with Presenter *

 private SearchPresenter presenter;

 public void setPresenter(SearchPresenter presenter) {
  this.presenter = presenter;

We can access fields by using ui:field="name_Of_Field_to_manipulate_with_Java". Below is the xml file used by UiBinder in order to render the page of the web application.