Wednesday, September 18, 2013

Parse HTML with JSOUP API

filotechnologia

  This article is intented for those who want to manipulate and/or extract data from html pages.
For this purpose the jsoup api is being used, which is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

 This library implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

The core functionality of this api is displayed in the next few paragraphs along with some simple examples. Generally jsoup can.. :
  • Scrape and parse HTML from a URL, file, or string 
Example:

File htmlFile = new File("../workspace/HtmlParserJsoup/GET_RESPONSE.html"); 
Document doc = Jsoup.parse(htmlFile, "UTF-8"); 

  • Find and extract data, using DOM traversal or CSS selectors
Example :

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
  // img with src ending .png
Element masthead = doc.select("div.masthead").first();
  // div with class=masthead
Elements resultLinks = doc.select("h3.r > a"); // direct a after h3

  • Manipulate the HTML elements, attributes, and text
Example:
Element div = doc.select("div").first(); // 
div.html("lorem ipsum"); // lorem ipsum div.prepend("First"); div.append("Last"); // now: First lorem ipsum Last Element span = doc.select("span").first(); // One span.wrap(""); // now:One

  •  Clean user-submitted content against a safe white-list, to prevent XSS attacks
Example:
String unsafe ="Link";
String safe = Jsoup.clean(unsafe, Whitelist.basic());

  •  output tidy HTML

UNAGI:

  Jsoup is an open source project distributed under the liberal MIT license. The source code is available at GitHub. You can download the latest version .jar here.
An example of HTML parsing can be found here.
Enjoy!   :-)


No comments:

Post a Comment