Regular Expression doesn't match

Question

I've got a string with very unclean HTML. Before I parse it, I want to convert this:

<TABLE><TR><TD width="33%" nowrap=1><font size="1" face="Arial">
NE
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
DEK
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
143
</font> </TD>
</TR></TABLE>

in NE DEK 143 so it is a bit easier to parse. I've got this regular expression (RegexKitLite):

NSString *str = [dataString stringByReplacingOccurrencesOfRegex:@"<TABLE><TR><TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<\\/TR><\\/TABLE>" 
                                                     withString:@"$1 $3 $5"];

I'm no an expert in Regex. Can someone help me out here?

Regards, dodo

Regex experts say don't parse html with regex http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 Use an html parser instead. — Amarghosh, May 03 '10 at 09:29
I use a html parser, but if I filter this out, is becomes much easier to use the html parser... — dododedodonl, May 03 '10 at 11:36

Delan Azabani · Accepted Answer · 2010-05-03T11:14:46.293

1

Amarghosh, and bobince, the winning answerer of linked question, is generally right about this. However, since you are just sanitising, regexps are actually just fine.

First, strip the tags:

s/<.*?>//

Then collapse all extra spaces into one:

s/\s+/ /

Then remove leading/trailing space:

s/^\s+|\s+$//

Then get the values:

^([^ ]+) ([^ ]+) ([^ ]+)$

edited May 03 '10 at 11:14

answered May 03 '10 at 09:39

Delan Azabani

79,602
28
170
210

No it won't; .* is not greedy. – Delan Azabani May 03 '10 at 11:13
What about a tag with an embedded newline? Or something like this: `` – Dan May 03 '10 at 11:48
As most of you know, regexps are /bad/ for parsing markup. My answer was just there because OP asked for a regexp method to extract a few data pieces. Again, regexps are /bad/ because they don't catch edge cases, like you pointed out. – Delan Azabani May 03 '10 at 11:49

score 0 · Answer 2 · answered May 03 '10 at 09:49

I have a few suspicions about why your regex might fail (without knowing the rules for string escaping in the iPhone SDK): The dot . used in places where it would have to match newlines, the slash looks like it's escaped unnecessarily etc.,

but: in your example, the text you're trying to extract is characterized by not being surrounded by tags.

So a search for all occurences of (?m)^[^<>\r\n]$ should find all matches.

score 0 · Answer 3 · answered May 03 '10 at 10:46

If you sure of your html-code hierarchy, then you can just extract text enclosed by font-tags:

Regex r = Regex(@"<\s*font((\s+[^<>]*)|(\s*))>(?<desiredText>[^<>]*)<\s*/\s*font\s*>")
//C# example
foreach(Match m in r.Matches(txt))
   result += m.Groups["desiredText"].Value.Trim()

; It will be text enclosed by font-tags without white-space symbols by edges.

Regular Expression doesn't match

3 Answers3