0

I'm trying to parse following code:

<td class='postac'>Actelsar </td>
<td class='postac'>tabl. 80 mg 28 tabl.</td>

The input should be the text (without "<" sign) between "<td class='postac'> </td>" tags and that starts with capital letter.

Regex: /<td class=\'postac\'>^[A-Z]+([^<]*)$<\/td>/s

The code above doesn't work. Thanks for your help.

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
mik.ro
  • 4,381
  • 2
  • 18
  • 23
  • 3
    Use [`DOMDocument`](http://php.net/manual/en/class.domdocument.php) and [`DOMXPath`](http://php.net/manual/en/class.domxpath.php) instead – Havelock Jan 08 '13 at 22:33
  • @Havelock: why to prefer XPath over regular expressions *in this particular case*? – zerkms Jan 08 '13 at 22:33
  • `^` means start of subject, and `$` end of subject. Which won't ever work if there is some text in front and something behind it. -- See also [Open source RegexBuddy alternatives](http://stackoverflow.com/questions/89718/is-there) and [Online regex testing](http://stackoverflow.com/questions/32282/regex-testing) for some helpful tools, or [RegExp.info](http://regular-expressions.info/) for a nicer tutorial. – mario Jan 08 '13 at 22:33
  • 1
    Trying to parse HTML with regular expression is bad. See Havelocks comment http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454 It's not about "a particular case", it's about the whole idea to even try it. – KingCrunch Jan 08 '13 at 22:35
  • 1
    @KingCrunch: it's not a HTML parsing, it's a check if an arbitrary string matches arbitrary format. It's the nothing different to checking the nickname fits the some pattern using regex – zerkms Jan 08 '13 at 22:35
  • @zerkms because I think the OP is crawling pages and parsing them – Havelock Jan 08 '13 at 22:36
  • @zerkms Well, may be, but must say, that I'm not completely convinced. The OP doesn't tell, whats the goal and because it is obvious a HTML table I have to assume, that he tries to parse HTML. – KingCrunch Jan 08 '13 at 22:37
  • @KingCrunch: well, seems like it's subjective. For me it looks like a matching to the format, nothing more. – zerkms Jan 08 '13 at 22:37
  • "The OP doesn't tell, whats the goal and because it is obvious a HTML table I have to assume, that he tries to parse HTML" --- he said "I'm trying to parse following code:" --- which is definitely not a valid HTML, but a piece of it, which "by chance" looks like an HTML :-) – zerkms Jan 08 '13 at 22:38
  • @zerkms The OP might be matching against results of unit tests, then maybe yes, but still would take the other approach if I wouldn't feel comfortable with RegExps – Havelock Jan 08 '13 at 22:40
  • You say that it's not HTML parsing, but it *is* HTML parsing. – Andy Lester Jan 08 '13 at 22:42
  • @Havelock Thanks for the link. I'll look at this after learning the regex's basics;) – mik.ro Jan 08 '13 at 22:43
  • @Andy Lester: how this task differs from this: please help me match the string that starts with capital letter and doesn't contain `<` from the string `Foo Bar `? Is there any conceptual difference? (keep in mind I took the exact task definition, but another string) – zerkms Jan 08 '13 at 22:44
  • 1
    Using regular expressions on non-regular languages (e.g. HTML, XML, any programming language) is OK for a one-time command line hack. For for anything expected to work repeatedly, use the proper parser. – kevin cline Jan 08 '13 at 23:19
  • @kevin cline: is it "allowed" to parse a name from the string using regex `Foo Bar `? And from the string `Foo Bar `? And from the string `Foo Bar `? And from the string `Foo Bar `? And from the string `[baz]Foo Bar [/baz]`? On which step it becomes a big no-no-no? – zerkms Jan 08 '13 at 23:40

3 Answers3

4

The code above doesn't work

It doesn't because for some reason you've put a $ and ^ signs in the middle of regex (which means the end and the beginning of the string/line correspondingly)

This should do what you want:

/<td class=\'postac\'>([A-Z][^<]*)<\/td>/s
zerkms
  • 249,484
  • 69
  • 436
  • 539
  • 3
    Good old questions about micro-optimization: Wouldn't `[A-Z]` be more efficient, instead `[A-Z]+`? The parser can stop after the capital letter to look for capital letters and just go on with the others. Well, while writing I realize, that the worst case is a single wrong test... Nothing said :) – KingCrunch Jan 08 '13 at 22:39
2

Use HTML parser to parse HTML not Regular Expression. It can be easily done by DOMDocument and DOMXPath.

$doc = new DOMDocument();
$doc->loadHTML($str);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//td[@class="postac"]');
$result = array();
for($i=0;$i<$nodes->length; $i++){
    $text = $nodes->item($i)->textContent;
    if(isset($text[0])&&ctype_upper($text[0])) $result[]= $text;
}

See the code in action.

Shiplu Mokaddim
  • 56,364
  • 17
  • 141
  • 187
  • That's great until he gets `` or `` or `` or `` or `` followed by a line feed before the name starts, or.... – Andy Lester Jan 08 '13 at 22:44
  • Yes I know that. I was actually changing my code. See update. Its changed. – Shiplu Mokaddim Jan 08 '13 at 22:50
  • Your code is 5 times longer than the one with regex and it doesn't even check for the requirement from the question (about capital letter) – zerkms Jan 08 '13 at 22:51
  • @zerkms But it handles everything very well and it takes short time to write it. – Shiplu Mokaddim Jan 08 '13 at 22:53
  • @shiplu.mokadd.im: what it handles actually? "and that starts with capital letter." - m? The fanatical following some "dogmas" is not good in programming. – zerkms Jan 08 '13 at 22:54
  • `$text{0}` - curly braces are not recommended to use nowadays. What if there is an empty string? This code will throw a notice? – zerkms Jan 08 '13 at 22:56
  • @zerkms wasting hours to find a regex when you dont know one is not productive either. – Shiplu Mokaddim Jan 08 '13 at 22:57
  • @shiplu.mokadd.im: wasting time to write 8 lines of code that are notice-prone instead of a single line isn't more productive – zerkms Jan 08 '13 at 22:57
  • 1
    @kevin cline: what is the criteria for the answer to be called "right"? – zerkms Jan 08 '13 at 23:41
0

/<td class=\'postac\'>([A-Z]+.*)<\/td>/ will match Actelsar, but not tabl. 80 mg tabl.

jakeonrails
  • 1,885
  • 15
  • 37
  • Might be worth noting: It will also match `Actelsar tabl. 80 mg 28 tabl.` from `Actelsar tabl. 80 mg 28 tabl.` – femtoRgon Jan 08 '13 at 22:39
  • Righto, was testing on Rubular with multiple lines in the input. `([A-Z]+[^>]*)<\/td>` is a better solution that handles multiple elements on one line. – jakeonrails Jan 08 '13 at 22:41