68

Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate.

Does such a library exist, or am I better off just trying to use regular expressions?

Charles Stewart
  • 11,661
  • 4
  • 46
  • 85
Sophie Alpert
  • 139,698
  • 36
  • 220
  • 238
  • I like Ben Reeves lightweight wrapper, that he mentioned in this thread. Wrapper has moved on github: [Objective-C-HMTL-Parser](https://github.com/zootreeves/Objective-C-HMTL-Parser) – yarchiko Jul 30 '12 at 09:06
  • 1
    How is this question "not constructive"? – 735Tesla Mar 29 '14 at 12:41

9 Answers9

89

I found using hpple quite useful to parse messy HTML. Hpple project is a Objective-C wrapper on the XPathQuery library for parsing HTML. Using it you can send an XPath query and receive the result .

Requirements:

-Add libxml2 includes to your project

  1. Menu Project->Edit Project Settings
  2. Search for setting "Header Search Paths"
  3. Add a new search path "${SDKROOT}/usr/include/libxml2"
  4. Enable recursive option

-Add libxml2 library to to your project

  1. Menu Project->Edit Project Settings
  2. Search for setting "Other Linker Flags"
  3. Add a new search flag "-lxml2"

-From hpple get the following source code files an add them to your project:

  1. TFpple.h
  2. TFpple.m
  3. TFppleElement.h
  4. TFppleElement.m
  5. XPathQuery.h
  6. XPathQuery.m

-Take a walk on w3school XPath Tutorial to feel comfortable with the XPath language.

Code Example

#import "TFHpple.h"

NSData *data = [[NSData alloc] initWithContentsOfFile:@"example.html"];

// Create parser
xpathParser = [[TFHpple alloc] initWithHTMLData:data];

//Get all the cells of the 2nd row of the 3rd table 
NSArray *elements  = [xpathParser searchWithXPathQuery:@"//table[3]/tr[2]/td"];

// Access the first cell
TFHppleElement *element = [elements objectAtIndex:0];

// Get the text within the cell tag
NSString *content = [element content];  

[xpathParser release];
[data release];

Known issues

As hpple is a wrapper over XPathQuery which is another wrapper, this option probably is not the most efficient. If performance is an issue in your project, I recommend to code your own lightweight solution based on hpple and xpathquery library code.

ldiqual
  • 15,015
  • 6
  • 52
  • 90
Albaregar
  • 5,480
  • 4
  • 20
  • 9
  • 2
    I used this just now, and it worked very well so far. – Karsten Silz Mar 11 '10 at 07:57
  • It is working very fine with the String Data. Can you please tell me how can I get and show an Image from the html ? – Akshay Jul 12 '11 at 06:37
  • Askhay, images are not stored in the HTML. You must get the URL and download it yourself. You could use the [NSData dataWithContentsOfURL] to get the file once you have the URL. – Maciej Swic Jul 25 '11 at 13:11
  • 1
    Since Jul 8, 2011 the method `search:` of `TFHpple` was renamed to `searchWithXPathQuery:` See [https://github.com/topfunky/hpple/commit/fd5ec102a55ce08f68c6f2060acfcdfb2d3a13a3](https://github.com/topfunky/hpple/commit/fd5ec102a55ce08f68c6f2060acfcdfb2d3a13a3) – Protocole May 06 '12 at 14:57
  • This worked very well for me, thank you. I do have a strange quirk where file names seem to have a space character prepended to them but this may be occurring due to a coding bug and have nothing to do with hpple. – Robert Nov 05 '12 at 15:10
  • Can you add/remove elements using Hpple? – Valerio Santinelli Feb 13 '13 at 16:53
49

Looks like libxml2.2 comes in the SDK, and libxml/HTMLparser.h claims the following:

This module implements an HTML 4.0 non-verifying parser with API compatible with the XML parser ones. It should be able to parse "real world" HTML, even if severely broken from a specification point of view.

That sounds like what I need, so I'm probably going to use that.

Sophie Alpert
  • 139,698
  • 36
  • 220
  • 238
19

Just in case anyone has got here by googling for a nice XPath parser and gone off and used TFHpple, Note that TFHpple uses XPathQuery. This is pretty good, but has a memory leak.

In the function *PerformXPathQuery, if the nodes are found to be nil, it jumps out before cleaning up.

So where you see this bit of code: Add in the two cleanup lines.

  xmlNodeSetPtr nodes = xpathObj->nodesetval;
  if (!nodes)
    {
      NSLog(@"Nodes was nil.");
        /* Cleanup */
        xmlXPathFreeObject(xpathObj);
        xmlXPathFreeContext(xpathCtx);
      return nil;
    }

If you are doing a LOT of parsing, it's a vicious leak. Now.... how do I get my night back :-)

DavidAWalsh
  • 927
  • 7
  • 7
12

I wrote a lightweight wrapper around libxml which maybe useful:

Objective-C-HMTL-Parser

forsvarir
  • 10,749
  • 6
  • 46
  • 77
  • Looks great Ben. I may be using it in my upcoming iPad application. – Brock Woolf Aug 12 '10 at 08:21
  • 2
    Site is down, you should post this on GitHub! – bentford Apr 09 '12 at 22:27
  • Ben, I tried to add your library - is it for iphone development as well? since I get http://stackoverflow.com/questions/14086354/adding-htmlparser-library-undefined-symbols-for-architecture-armv7s – Dejell Dec 29 '12 at 22:11
5

This probably depends on how messy the HTML is and what you want to extract. But usually Tidy does quite a good job. It is written in C and I guess you should be able to build and statically link it for the iPhone. You can easily install the command line version and test the results first.

tcurdt
  • 14,518
  • 10
  • 57
  • 72
5

You may want to check out ElementParser. It provides "just enough" parsing of HTML and XML. Nice interfaces make walking around XML / HTML documents very straightforward. http://touchtank.wordpress.com/

4

How about using the Webkit component, and possibly third party packages such as jquery for tasks such as these? Wouldn't it be possible to fetch the html data in an invisible component and take advantage of the very mature selectors of the javascript frameworks?

tore
  • 41
  • 1
3

Google's GData Objective-C API reimplements NSXMLElement and other related classes that Apple removed from the iPhone SDK. You can find it here http://code.google.com/p/gdata-objectivec-client/. I've used it for dealing messaging via Jabber. Of course if your HTML is malformed (missing closing tags) this might not help much.

dnolen
  • 18,496
  • 4
  • 62
  • 71
3

We use Convertigo to parse HTML on the server side and return a clean and neat JSON web services to our Mobile Apps

Wulkanman
  • 31
  • 1