0

I want to extract an authors name out of an html tag. The tag looks like this:

<a href="http://somewhere.com">    Manfred    </a>

but if the name is to long, it looks like this:

<a title="floormanager004" href="http://somewhere.com">    floormanage...    </a>

I have the following regex to cover both cases:

~<a.*(title="(.{2,50})".*|>(.*))</a>~Usi

This works fine in the second case, returning a two dimensional array like this:

array(2) {
  [0]=>
  string "title="floormanager004" href="http://somewhere.com">    floormanage...    "
  [1]=>
  string "floormanager004"
}

But for the first case, the array contains an additional empty field:

array(2) {
  [0]=>
  string ">    Manfred    "
  [1]=>
  string ""
  [2]=>
  string "    Manfred    "
}

Why does this field appear and how to get rid of it?

Disclaimer: I know when using regex to parse html you gonna have a baaaaad time and you should never ever ever do this, but in my case it's proven to be faster than XPATH and the like. Please don't comment on this.

Thomas
  • 10,289
  • 13
  • 39
  • 55
  • 1
    The first tag doesn't have the `title` attribute which you look for in your regex, meaning, if it's not there, you'll get an empty entry in the result. – Nadh Apr 27 '12 at 11:39

2 Answers2

1

Every set of parentheses is going to have an associated value in the returned array every time there's a successful match on the whole regex, even if what the parenthesized bit matches is nothing. When some of the captures might be empty, your code needs to detect and handle that case.

Mark Reed
  • 91,912
  • 16
  • 138
  • 175
  • Thank you. So there is no way to do it in the regex? The first path `title="(.{2,50})".*` shouldn't match at all for _Manfred_, so even then an array entry is created? – Thomas Apr 27 '12 at 12:23
  • ups, forgot to format my answer right. now you should see what i did. – Dario Pedol Apr 27 '12 at 12:46
  • You can't build a regular expression that will return only the positions that matched something. If it only returned one thing, how would you know which set of parentheses it came from? Besides, matching "nothing" is still a successful match, so it returns the nothing that it matched. – Mark Reed Apr 27 '12 at 21:06
0

The title attribute is missing for good ol' Manfred.

This works for those cases:

~<a.*>(.*)</a>~Usi

I just can't be quite about this: See the second most voted question on Stackoverflow. I suggest you read the whole thing:

RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
Dario Pedol
  • 2,070
  • 14
  • 24
  • I suggest the same to you, especially the second answer ;) Please stop answering with this out of reflex, there are a lot of cases where you don't want all of the documents elements to be parsed, but only tiny fractions of them. In this case XML parsing is a lot of overhead. – Thomas Apr 27 '12 at 12:26