Monday, November 24, 2014

About PDFBOX..(Current version: 1.8.6)




Brief description:
The Apache PDFBox library is an open source Java tool for working with pdf documents. (Current version: 1.8.6)

Functionalities:
  • Extract text
    PDFBox is able to quickly and accurately extract text from a variety of PDF documents. This functionality is encapsulated in the org.apache.pdfbox.util.PDFTextStripper.

    Example:

    public class PDFReader{
       public static void main(String args[])
       {PDFTextStripper pdfStripper = null;
       PDDocument pdDoc = null;
       File file = new File("C:/my.pdf");
       try {
           PDFParser parser = new PDFParser(new FileInputStream(file));
           parser.parse();
           pdfStripper = new PDFTextStripper();
           pdDoc = new PDDocument(cosDoc);
           String parsedText = pdfStripper.getText(pdDoc);
       } catch (IOException e) {
           e.printStackTrace();
           }
       }
    }
  • Extract pages
    A single page OR a range of pages can be extracted from a pdf.

    Example:

       try {
           PDFParser parser = new PDFParser(new FileInputStream(file));
           parser.parse();
           pdfStripper = new PDFTextStripper();
           pdDoc = new PDDocument(cosDoc);
           pdfStripper.setStartPage(1);
           pdfStripper.setEndPage(5);
           String parsedText = pdfStripper.getText(pdDoc);
           System.out.println(parsedText);
       } catch (IOException e) {
           e.printStackTrace();}
  • Extract height, width ../ Extract font /..Extract bold, italic ..
    In order to extract information related to the fonts one has to “dig” into the API and override certain functions.
    Help from Stackoverflow
  • Extract tables
    Extracting tabular information is not implemented.
    (The wheel has to be reinvented)
    Related post in Stackoverflow
  • Extract regions of text../ Extract position of characters
    You can extract text by area in PDFBox.The problem is getting the coordinates in the first place. Asolution is to extend the normal TextStripper, overriding process TextPosition(TextPosition text) and printing out the coordinates for each character and figuring out where in the document they are.

    Example:

    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    stripper.setSortByPosition( true );
    Rectangle rect = new Rectangle( 500, 100, 55, 5);
    stripper.addRegion( "class1", rect );
    stripper.extractRegions( page );
    String string = stripper.getTextForRegion( "class1" );

Note: The documentation is very hectic and the resources are scarce. This is mainly because PDFBox is a low-level api.

Thursday, July 10, 2014

View in GWT using the UiBinder technique.

This is an implementation of building a View in GWT using the UiBinder technique. In part 1 you can find the basic setup for this project. 

PART 2 

SearchPresenter (Class)
  
package helloworld.client.view;
import helloworld.client.presenter.SearchPresenter;
import com.google.gwt.user.client.ui.Widget;
public interface SearchView {
 public interface Presenter { }
 void setPresenter(SearchPresenter presenter);
 Widget asWidget();
}

SearchView (Interface)

package helloworld.client.presenter;
import helloworld.client.view.SearchView;
import com.google.gwt.user.client.ui.HasWidgets;
public class SearchPresenter implements Presenter, SearchView.Presenter {

    private final SearchView view;

    /** Constructor **/
    public SearchPresenter(SearchView view) {
        this.view = view;
        bind();
    }

    public final void bind() {
        view.setPresenter(this);
    }

    /**************
     * SET VIEW *
     **************/

    @Override
    public final void go(final HasWidgets container) {
        container.clear();
        container.add(view.asWidget());
    }

*The "owner" class is instantiated inside the function onModuleLoad().

SearchViewImpl ("Owner" Class)
  
package helloworld.client.view;

import *HINT:Hit {ctrl+shft+O} if you use Eclipse*

public class SearchViewImpl extends Composite implements SearchView {

 @UiField TextBox searchBox;
 @UiField Button buttonSubmit;
    
 private static SearchViewUiBinder uiBinder = GWT.create(SearchViewUiBinder.class);

 @UiTemplate("SearchView.ui.xml")
 interface SearchViewUiBinder extends UiBinder { }

 public SearchViewImpl() {
  initWidget(uiBinder.createAndBindUi(this)); 
 }
 
 /**************************
  * Connect with Presenter *
  **************************/

 private SearchPresenter presenter;

 public void setPresenter(SearchPresenter presenter) {
  this.presenter = presenter;
 }
  
}

We can access fields by using ui:field="name_Of_Field_to_manipulate_with_Java". Below is the xml file used by UiBinder in order to render the page of the web application.
  
 

 
 

Wednesday, June 4, 2014

GWT UiBinder Technique (Spring Project)

binders
 
    The UiBinder technique is also known as “Declarative Layout”. This means that the UIBinder provides a declarative way of defining User Interface and in general, it is similar to what JSP is to Servlets. It is designed in order to separate Functionality and View of the UI and make them loosely coupled. Practically it separates the program logic from UI. Developers can build gwt applications as HTML pages with GWT widgets configured throughout them. It also makes the collaboration with UI designers easier, since they are more comfortable with XML, HTML and CSS than Java source code.

PART 1

The basic setup is taken form a Spring project with GWT integration(for designing the UI).
In src/main/java:

BASIC SETUP:

VIEW(Interface)
package helloworld.client.view;
import com.google.gwt.user.client.ui.IsWidget;
import gr.planetek.saps.helloworld.client.presenter.Presenter;
public interface View extends IsWidget {
 void setPresenter(Presenter presenter);
}

PRESENTER (Interface)
package helloworld.client.presenter;

import com.google.gwt.user.client.ui.HasWidgets;

public abstract interface Presenter {
 void go(final HasWidgets container);
}

HELLOWORLD (Entry point class)
package helloworld.client;

import com.google.gwt.core.client.EntryPoint;
import com.google.gwt.core.client.GWT;

/**
 * Entry point classes define onModuleLoad().
 */
public class HelloWorld implements EntryPoint {
 @Override
 public void onModuleLoad() {
  AppController appViewer = new AppController();
  appViewer.go(RootPanel.get("gwtContainer"), new HeaderViewImpl());
   }
}
*In HTML page there is a div with this id:
id="gwtContainer"

APP CONTROLLER(Class)
  
    public class AppController implements ValueChangeHandler {
    private HasWidgets container;
    public AppController() {
        bind();
    }

    /**
     * #1 Implementing the History Management 
     * #1^Binding events to actions in the code 
     * #2 Handling history changes  
     **/

    private void bind() {
        // TODO ..
        History.addValueChangeHandler(this); // #1
    }
    /******** Event & History Management ***************/

    @Override
    public final void onValueChange(ValueChangeEvent event) { // #2
        String token = event.getValue(); // #2
        if (token != null) {             // #2
            if (token.startsWith("Home")) {
                displaySearch(token);    // #2
            }   
        }
    }

    private void displaySearch(String token) { // #3
        container = RootPanel.get("gwtContainer");
        container.clear();
        SearchPresenter presenter = null;
        SearchView searchView = new SearchViewImpl();
        presenter = new SearchPresenter(searchView);
        presenter.go(container);
    }

    public final void go(final HasWidgets container, Widget widget) {
        this.container = container;
        container.add(widget);

        if ("".equals(History.getToken())) {
            History.newItem("Home");
        } else {
            History.fireCurrentHistoryState();
        }
    }
}
This was the basic setup. In part 2 we implement a View using UiBinder.

Monday, March 17, 2014

Extract Text from PDF [PDFTextStream API]




    In order to extract textual content from a pdf it is probably more convenient to use an api like PDFTextStream. Generally the extraction is pretty simple but not exactly a 1-line solution. A code example is given below.

    /** Get Content of PDF **/
        PDFTextStreamConfig configuration=new PDFTextStreamConfig();
        configuration.setTableDetectionEnabled(true);
        PDFTextStream stream = new PDFTextStream(new File(filePath)); 
        StringBuilder text = new StringBuilder();
        OutputTarget  target = new OutputTarget(text);
        stream.pipe(target);
        content = target.getObject().toString();
    If we have to compare it with another similar API like Tika , i have to say that PDFTextStream is faster and it can give you more useful info. In particular Tika extracts raw textual information from a pdf, while PDFTextStream gives you structured text with correlated info such as character encoding, height, region of the text etc. The result is structured in  Pages --> Blocks --> Lines --> TextUnits  and below is an example of how we can get the pages from a pdf.
 
    /** Get number of pages **/ 
    int pageCnt = stream.getPageCnt();
    for(int i = 0; i<=pageCount; i++){
        Page page = stream.getPage(i);
        System.out.println("PAGE NR[" + i + "]:\n" + page);
    }
    In order to get the Blocks from a Page, the following code must be implemented. This will return the Blocks of a Page:
 
    private List getBlocks(Page page) throws IOException {
        List listOfBlocks = new ArrayList();
        int numBlockChilds = page.getTextContent().getChildCnt(); 
         
        for (int i = 0; i < numBlockChilds; i++){  
            Block block = page.getTextContent().getChild(i); 
            listOfBlocks.add(block); 
        }

        return listOfBlocks;  
 }

    The final step in order to get raw textual information is to extract the text from the block:

 
 private String block2txt(Block block) throws IOException {
  StringBuilder text = new StringBuilder();
                OutputTarget  target = new OutputTarget(text);
  block.pipe(target); 
  String block_text = target.getObject().toString();
  return block_text;
 }
    And this will return the text from a Block. Similarly we can navigate all content of the pdf and also extract the text from a Page or a Line.