Regex to find the string that starts with capital letter and is without "<" sign

Question

I'm trying to parse following code:

<td class='postac'>Actelsar </td>
<td class='postac'>tabl. 80 mg 28 tabl.</td>

The input should be the text (without "<" sign) between "<td class='postac'> </td>" tags and that starts with capital letter.

Regex: /<td class=\'postac\'>^[A-Z]+([^<]*)$<\/td>/s

The code above doesn't work. Thanks for your help.

Use [`DOMDocument`](http://php.net/manual/en/class.domdocument.php) and [`DOMXPath`](http://php.net/manual/en/class.domxpath.php) instead — Havelock, Jan 08 '13 at 22:33
@Havelock: why to prefer XPath over regular expressions *in this particular case*? — zerkms, Jan 08 '13 at 22:33
`^` means start of subject, and `$` end of subject. Which won't ever work if there is some text in front and something behind it. -- See also [Open source RegexBuddy alternatives](http://stackoverflow.com/questions/89718/is-there) and [Online regex testing](http://stackoverflow.com/questions/32282/regex-testing) for some helpful tools, or [RegExp.info](http://regular-expressions.info/) for a nicer tutorial. — mario, Jan 08 '13 at 22:33
Trying to parse HTML with regular expression is bad. See Havelocks comment http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454 It's not about "a particular case", it's about the whole idea to even try it. — KingCrunch, Jan 08 '13 at 22:35
@KingCrunch: it's not a HTML parsing, it's a check if an arbitrary string matches arbitrary format. It's the nothing different to checking the nickname fits the some pattern using regex — zerkms, Jan 08 '13 at 22:35
@zerkms because I think the OP is crawling pages and parsing them — Havelock, Jan 08 '13 at 22:36
@zerkms Well, may be, but must say, that I'm not completely convinced. The OP doesn't tell, whats the goal and because it is obvious a HTML table I have to assume, that he tries to parse HTML. — KingCrunch, Jan 08 '13 at 22:37
@KingCrunch: well, seems like it's subjective. For me it looks like a matching to the format, nothing more. — zerkms, Jan 08 '13 at 22:37
"The OP doesn't tell, whats the goal and because it is obvious a HTML table I have to assume, that he tries to parse HTML" --- he said "I'm trying to parse following code:" --- which is definitely not a valid HTML, but a piece of it, which "by chance" looks like an HTML :-) — zerkms, Jan 08 '13 at 22:38
@zerkms The OP might be matching against results of unit tests, then maybe yes, but still would take the other approach if I wouldn't feel comfortable with RegExps — Havelock, Jan 08 '13 at 22:40
You say that it's not HTML parsing, but it *is* HTML parsing. — Andy Lester, Jan 08 '13 at 22:42
@Havelock Thanks for the link. I'll look at this after learning the regex's basics;) — mik.ro, Jan 08 '13 at 22:43
@Andy Lester: how this task differs from this: please help me match the string that starts with capital letter and doesn't contain `<` from the string `Foo Bar `? Is there any conceptual difference? (keep in mind I took the exact task definition, but another string) — zerkms, Jan 08 '13 at 22:44
Using regular expressions on non-regular languages (e.g. HTML, XML, any programming language) is OK for a one-time command line hack. For for anything expected to work repeatedly, use the proper parser. — kevin cline, Jan 08 '13 at 23:19
@kevin cline: is it "allowed" to parse a name from the string using regex `Foo Bar `? And from the string `Foo Bar `? And from the string `Foo Bar `? And from the string `Foo Bar `? And from the string `[baz]Foo Bar [/baz]`? On which step it becomes a big no-no-no? — zerkms, Jan 08 '13 at 23:40

zerkms · Accepted Answer · 2013-01-08T22:40:01.137

4

The code above doesn't work

It doesn't because for some reason you've put a $ and ^ signs in the middle of regex (which means the end and the beginning of the string/line correspondingly)

This should do what you want:

/<td class=\'postac\'>([A-Z][^<]*)<\/td>/s

edited Jan 08 '13 at 22:40

answered Jan 08 '13 at 22:32

zerkms

249,484
69
436
539

3

Good old questions about micro-optimization: Wouldn't `[A-Z]` be more efficient, instead `[A-Z]+`? The parser can stop after the capital letter to look for capital letters and just go on with the others. Well, while writing I realize, that the worst case is a single wrong test... Nothing said :) – KingCrunch Jan 08 '13 at 22:39

Shiplu Mokaddim · Answer 2 · 2013-01-08T23:57:27.177

2

Use HTML parser to parse HTML not Regular Expression. It can be easily done by DOMDocument and DOMXPath.

$doc = new DOMDocument();
$doc->loadHTML($str);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//td[@class="postac"]');
$result = array();
for($i=0;$i<$nodes->length; $i++){
    $text = $nodes->item($i)->textContent;
    if(isset($text[0])&&ctype_upper($text[0])) $result[]= $text;
}

See the code in action.

edited Jan 08 '13 at 23:57

answered Jan 08 '13 at 22:33

Shiplu Mokaddim

56,364
17
141
187

That's great until he gets `` or `` or `` or `` or `` followed by a line feed before the name starts, or.... – Andy Lester Jan 08 '13 at 22:44
Yes I know that. I was actually changing my code. See update. Its changed. – Shiplu Mokaddim Jan 08 '13 at 22:50
Your code is 5 times longer than the one with regex and it doesn't even check for the requirement from the question (about capital letter) – zerkms Jan 08 '13 at 22:51
@zerkms But it handles everything very well and it takes short time to write it. – Shiplu Mokaddim Jan 08 '13 at 22:53
@shiplu.mokadd.im: what it handles actually? "and that starts with capital letter." - m? The fanatical following some "dogmas" is not good in programming. – zerkms Jan 08 '13 at 22:54
`$text{0}` - curly braces are not recommended to use nowadays. What if there is an empty string? This code will throw a notice? – zerkms Jan 08 '13 at 22:56
@zerkms wasting hours to find a regex when you dont know one is not productive either. – Shiplu Mokaddim Jan 08 '13 at 22:57
@shiplu.mokadd.im: wasting time to write 8 lines of code that are notice-prone instead of a single line isn't more productive – zerkms Jan 08 '13 at 22:57
1

@kevin cline: what is the criteria for the answer to be called "right"? – zerkms Jan 08 '13 at 23:41

score 0 · Answer 3 · answered Jan 08 '13 at 22:36

0

/<td class=\'postac\'>([A-Z]+.*)<\/td>/ will match Actelsar, but not tabl. 80 mg tabl.

answered Jan 08 '13 at 22:36

jakeonrails

1,885
15
37

Might be worth noting: It will also match `Actelsar tabl. 80 mg 28 tabl.` from `Actelsar tabl. 80 mg 28 tabl.` – femtoRgon Jan 08 '13 at 22:39
Righto, was testing on Rubular with multiple lines in the input. `([A-Z]+[^>]*)<\/td>` is a better solution that handles multiple elements on one line. – jakeonrails Jan 08 '13 at 22:41

Regex to find the string that starts with capital letter and is without "<" sign

3 Answers3