0

I've been pulling my hair out trying to come up with a regx that will pull the First and Last Name from the following HTML. My regex fu is not strong.

<span id="label_85110"><b>First Name</b></span>
<br/>
    <span id="value_85110">AWeber- Email Parser</span>
    <br/>
</p>
<p>
<span id="label_86004"><b>Last Name</b></span>
<br/>
    <span id="value_86004">Submission</span>
    <br/>
</p>
<p>
<span id="label_85111"><b>Email</b></span>
<br/>
    <span id="value_85111">leslie@dakno.com</span>
    <br/>
</p>
<p>
<span id="label_85540"><b>Phone</b></span>
<br/>
    <span id="value_85540">919-923-7017</span>
    <br/>
</p>
OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
oliver1
  • 9
  • 1

3 Answers3

3

@oliver1,

Please note that the keyword in Regular Expression is "Regular." Regular Expressions are used with Regular Languages.

Unfortunately, (X)HTML is not a Regular Language. Rather, it is a Context Free Language.

You cannot write a RegEx which can properly parse a Context Free Language. This is a mathematically proven reality; you cannot write a RegEx which can properly parse a Context Free Language.

The Solution: Use XPath

Instead you should use an XML parser; you are already using XHTML which means you could instead use XPath. (although you're missing an <p> at the beginning of your code snippet)

How can any parser, RegEx or query identify the first names and last names? The best I see is "<span> elements which come after a <br />" which is pretty weak.

You can nonetheless write an XPath query to find "<span> elements which come after a <br />".

//br/following-sibling::span/text()

... but that also finds the values of Email and Phone, so you'll want only the first two results.

Alternately, you could instead use the id attributes on the <span> elements:

//span[@id='value_85110']/text()|//span[@id='value_86004']/text()

If You Can Modify The HTML

Ideally, my suggestion is to make your XHTML more semantic:

<label for="first-name-1">First Name</label>
<span id="first-name-1" class="first-name">Aweber- Email Parser</span>
<label for="last-name-1">Last Name</label>
<span id="last-name-1" class="last-name">Submission</span>
<label for="email-address-1">Email</label>
<span id="email-address-1" class="email-address">leslie@dakno.com</span>
<label for="phone-number-1">Phone</label>
<span id="phone-number-1" class="phone-number">919-923-7017</span>

Enhance it with CSS (instead of using <b> and <br/> all over the place)...

label {
    font-weight:bolder;
    display:block;
    maring-top:5px;
}
span {
    display:block;
    maring-bottom:5px;
}

... and then use an XPath query like so:

//span[@class='first-name'] | //span[@class='last-name']
Richard JP Le Guen
  • 28,364
  • 7
  • 89
  • 119
  • Why do you assume that the poster can influence the generation of the html? If you could, he would not need to parse it in the first place ... He could just to a normal DB query then ... – maxschlepzig Aug 31 '10 at 21:24
  • @maxschlepzig - Edited to emphasize "Use XPath" as opposed to "Fix the HTML" – Richard JP Le Guen Sep 01 '10 at 03:17
0

Disclaimer: This is just an answer to the problem, not an endorsement of using regex for this purpose.

<span[^>]*?><b>First Name(?:<[^>]+?>|\s)+([^<]*?)(?:<[^>]+?>|\s)+?Last Name(?:<[^>]+?>|\s)+([^<]*)[\S\s]+?Phone[\S\s]+?<\/p>

then just grab groups 1 and 2 for each match. tested this with firefox's javascript flavor of regex.

From a philosophical standpoint XPath is probably a more robust solution if you have an XPath-capable HTML parser or if you are sure that you are working with valid XML, which what you posted is not (missing a document root node and an opening <p> tag at the beginning).

BCG
  • 1,170
  • 8
  • 19
-1

Depends a little bit on the syntax your actual regex library or tool, but basically use something like this:

<span id="label_85110"><b>([^<]+)</b>

Then you can access the first match group via some API.

Extract the last name similar to that.

Btw, some may argue: 'regex are the wrong tool for extracting data from HTML !!elf!1!'

Well, that is up to the poster. He is asking for a regular expression. And we don't know the details. Perhaps for his restricted use case everything else is overkill. (e.g. one time analysis and it is guaranteed that input data always uses the posted skeleton etc.)

maxschlepzig
  • 35,645
  • 14
  • 145
  • 182
  • 1
    -1 as "we don't know the details" is exactly why we can't encourage any poster to use RegEx to parse HTML. In the absence of other information, the norm is that you shouldn't parse (X)HTML with RegEx. – Richard JP Le Guen Aug 31 '10 at 20:29
  • I am not encouraging the poster. The poster got a few comments about possible disadvantages/pitfalls of using regexes. It is his decision what to do. I posted the answer, s.t. if he decides pro-regex he gets a hint how to use regex-group matchings. – maxschlepzig Aug 31 '10 at 21:21
  • 3
    +1 for trying to give the poster what he's asking for. He's not asking for a diatribe on why he shouldn't... he wants to know how. He's not even asking for a full blown parser... he just wants to extract some text. – BCG Aug 31 '10 at 21:51
  • @bgould - I agree that the OP doesn't need a diatribe; linking to [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454) has never been useful for anyone. Instead we have to explain _why_ what the OP is doing is not a good idea; your answer is good (and not voted down) as you emphasize that your solution is limited in scope and not a general solution. @maxschlepzig's answer doesn't concede that point and instead challenges the competence of those who suggest otherwise. – Richard JP Le Guen Sep 01 '10 at 14:25
  • 1
    Using a html/xml parser is hardly overkill. it is only a few more lines of code. The hardest part is conceptual. It means the OP gets a shiny new tool for his belt. – Byron Whitlock Sep 01 '10 at 18:47