- htmlcleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.
- XPath is a language for addressing parts of an XML document, designed to be used by both XSLT and XPointer.
- XPath Helper makes it easy to extract, edit, and evaluate XPath queries on any webpage.
How to user them together.
One way you can use them is to grab info from a page in a way that you can use process it with your own programs. Please do check that this is compliant with such website terms and conditions ;). Here is how:
- Start with any page you like to get it´s content. Right from your browser use Xpath Helper to point to the element you want and you´ll get the expression you need. Tweak it a little bit to grab all the content you need and remove unnecessary parts.
- Use HTMLCleaner to programmatically grab that URL and then use the evaluateXPath function to get what you want out of the page.
- Do what ever you need with the data and have fun :)
Plus bonus for htmlcleaner it´s on the main maven repo so just include them in you project and go.
They worked great for me and hope they do for you.
1 comentario:
Htmlcleaner, XPath and XPath Helper are excellent tools; I use Mozilla Custom Browser Helper Objects for excellent performance and functionality.
Publicar un comentario