0

I've got a string with very unclean HTML. Before I parse it, I want to convert this:

<TABLE><TR><TD width="33%" nowrap=1><font size="1" face="Arial">
NE
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
DEK
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
143
</font> </TD>
</TR></TABLE>

in NE DEK 143 so it is a bit easier to parse. I've got this regular expression (RegexKitLite):

NSString *str = [dataString stringByReplacingOccurrencesOfRegex:@"<TABLE><TR><TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<\\/TR><\\/TABLE>" 
                                                     withString:@"$1 $3 $5"];

I'm no an expert in Regex. Can someone help me out here?

Regards, dodo

dododedodonl
  • 4,585
  • 6
  • 30
  • 43
  • 4
    Regex experts say don't parse html with regex http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 Use an html parser instead. – Amarghosh May 03 '10 at 09:29
  • I use a html parser, but if I filter this out, is becomes much easier to use the html parser... – dododedodonl May 03 '10 at 11:36

3 Answers3

1

Amarghosh, and bobince, the winning answerer of linked question, is generally right about this. However, since you are just sanitising, regexps are actually just fine.

First, strip the tags:

s/<.*?>//

Then collapse all extra spaces into one:

s/\s+/ /

Then remove leading/trailing space:

s/^\s+|\s+$//

Then get the values:

^([^ ]+) ([^ ]+) ([^ ]+)$
Delan Azabani
  • 79,602
  • 28
  • 170
  • 210
  • No it won't; .* is not greedy. – Delan Azabani May 03 '10 at 11:13
  • What about a tag with an embedded newline? Or something like this: `` – Dan May 03 '10 at 11:48
  • As most of you know, regexps are /bad/ for parsing markup. My answer was just there because OP asked for a regexp method to extract a few data pieces. Again, regexps are /bad/ because they don't catch edge cases, like you pointed out. – Delan Azabani May 03 '10 at 11:49
0

I have a few suspicions about why your regex might fail (without knowing the rules for string escaping in the iPhone SDK): The dot . used in places where it would have to match newlines, the slash looks like it's escaped unnecessarily etc.,

but: in your example, the text you're trying to extract is characterized by not being surrounded by tags.

So a search for all occurences of (?m)^[^<>\r\n]$ should find all matches.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
0

If you sure of your html-code hierarchy, then you can just extract text enclosed by font-tags:

Regex r = Regex(@"<\s*font((\s+[^<>]*)|(\s*))>(?<desiredText>[^<>]*)<\s*/\s*font\s*>")
//C# example
foreach(Match m in r.Matches(txt))
   result += m.Groups["desiredText"].Value.Trim()

; It will be text enclosed by font-tags without white-space symbols by edges.

chapluck
  • 579
  • 3
  • 12