1

I'm using preg_match_all to match the text inside a <td> that is also between <strong> tags. But i have a problem, the html code has newlines in it; this is the html:

<td 
class="vcenter text-center">
<strong>Match This </strong></td>

For now i'm using this pattern to get the text:

!<td\nclass="vcenter text-center">\n<strong>(.*?)<\/strong><\/td>!

This does get the text but it won't work if that newline(in the td tag) disappears from the html code. What can i do in this situation?

P.S: I'm using curl to get that html(and i don't want to add an extra class like simple_html_dom :-s).

Thank you!

emma
  • 761
  • 5
  • 20
  • 2
    You don't need to add new class, the `DOMDocument` in native class in php – Mohammad Sep 12 '18 at 11:38
  • 2
    Agree that DOMDocument is the way to do this. ` – Alex K. Sep 12 '18 at 11:41
  • I'll echo the others who say that it's a bad idea to parse html using regex. It is possible to get it to work, but it will be very *very* brittle; small differences in the html are likely to break your parser. eg what happens to your regex if they stop using the `` tag?. That's fine if you have control over the HTML input and know that it won't change, but you probably don't have that. So if at all possible you should be handling this data outside of the HTML context; eg if an API exists then use that rather than parsing HTML. But if you *must* parse html, use DOMDocument, not regex. – Spudley Sep 12 '18 at 11:55
  • I am curious about your delimiters (`!`). You probably know that the normal delimiter is the slash (`/`). The usual reason to change the delimiter is to free the slash for simple matching in strings with many slashes inside them, such as HTML strings. You don’t appear to have taken advantage of that … ? – Manngo Sep 12 '18 at 21:17
  • Also, as per my comment on the accepted answer, I don’t think it’s always correct that you shouldn’t use regex for HTML strings. Your target string is simple and straight forward, and it is easily and reliably matched with a regular expression. – Manngo Sep 12 '18 at 21:20
  • With respect to @Toto, I don’t think this is a duplicate. It boils down to a question about regular expressions and how to handle the optional line break. It is not a question about _parsing_ HTML in general. Seriously, it’s _not_ about parsing, and the linked “duplicate” does _not_ answer this question at all. – Manngo Sep 13 '18 at 01:37

1 Answers1

2

You should not be using a regex to parse html, instead you should use an xml parser.

But as far as the new-line is concerned: You want 1 or more white-spaces, not specifically a new-line.

You could replace \n with \s+ to achieve that:

!<td\s+class="vcenter text-center">\n<strong>(.*?)<\/strong><\/td>!
jeroen
  • 91,079
  • 21
  • 114
  • 132
  • Hey @jeroen, thank you for this answer, it works :D but why shouldn't i use regex to parse html? :-s(i mean i was told this before but nobody could give me a 'why') – emma Sep 12 '18 at 11:54
  • 1
    @emma See the Regular Expressions section (at the end) of https://stackoverflow.com/a/3577662/42139 for example. – jeroen Sep 12 '18 at 11:56
  • @emma I’m going to give an opposing opinion on using Regex. Agreed, parsing a whole HTML file is well beyond the scope of Regex which, regardless of how complex the expression is, is still just matching patterns. However, If all you are doing is looking for one well defined pattern, Regex may well be a simple and reliable solution. Your particular case is one which is probably well served with Regex. – Manngo Sep 12 '18 at 21:13