Parse HTML With PHP
Not all websites have a easy to use API for looking up information, but you can parse HTML for almost every website to extract data. PHP can be used to parse the DOM (Document Object Model) of HTML pages and extract data. In PHP 5 the DOM extension was introduced that allows manipulation of the HTML/XML DOM.

The loadHTML function parses the HTML, and unlinke the loadXML function the source string does not have to be well-formed.

The getElementsByTagName function of the DOMDocument may not always be the best way to extract data. If you need to find a tag with a specific parameter you can use the DOMXPath object to search the DOM using selectors.

Alternatives to Parse HTML

  • XMLReader
    This extension is a XML Pull parser. The reader acts like a cursor going forward on the file and stopping at each node. Not as robust with HTML as DOMDocument::loadHTML since it does not handle malformed HTML very well.
    Read more about XMLReader
  • SimpleXml
    This extension provides an easy way to convert XML to an object that can be processed with Property Selectors and iterators. It is good to use if you know that the HTML is valid, but otherwise DOM option would be more rubust.
    Read more about SimpleXml
  • phpQuery
    This is a CSS3 selector drive DOM API based on jQuery Library. Completely written in PHP5 and does not require any extensions to be installed. If you are familiar with jQuery this is an excellent HTML parsing library to use.
    Read more about phpQuery
  • SimpleHtmlDom
    This library is not libxml based and completely written in PHP 5. Works well without valid HTML and has css-like selectors for searching for HTML nodes.
    Read more about SimpleHtmlDom
  • html5lib
    This library is written in PHP and is based on the HTML5 specification. Some HTML parser may not work entirily with the new HTML5 spec and this project is aimed at providing an API to parse HTML based on the HTML5 specification.
    Read more about html5lib
Date posted: July 4, 2014 | Author: | No Comments »

Categories: Programming Uncategorized Web

Tags: ,

Leave a Reply

Your email address will not be published. Required fields are marked *

Copyright © 2013, Neil Bittner. All Rights Reserved.