Wednesday, September 18, 2013

Extraction of links from an HTML document + Example (Jsoup 1.7.2)


    This is an example of how we can parse an HTML document and extract all the (valid) links contained in the BODY element. The "tool" used for the manipulation and extraction is Jsoup(version 1.7.2).

    The power of this API lies within the simplicity of the implementation. As a matter of fact we can efficiently extract any information in just a few lines of code. In primis an HTML File is read and parsed.

File htmlFile = new File("../workspace/HtmlParserJsoup/sample.html");
Document doc = Jsoup.parse(htmlFile, "UTF-8");
    As long as you pass in a non-null string, the output is a Document containing (at least) a head and a body element.(watch out .. the reference is @import org.jsoup.nodes.Document; )

    In order to extract the http links contained in the body element we initially find all the elements with the a tag, and then a loop is performed in order to check the validity of the links. The logic is pretty straight forward such as the flow of the program.

/*** GET ALL LINKS OF BODY ***/
Elements links = body.getElementsByTag("a");
for (Element link : links) 
{
   String linkHref = link.attr("href");
   String linkText = link.text();
   if((linkHref.length()>5)&&(linkHref.contains("http")))
      System.out.println("\n[DEBUG] linkHref : " + linkHref);
}
At the end you should have nothing else but valid http links ..

 

No comments:

Post a Comment