0

I want to match the first few paragraphs from html code and I have this code so far:

$page_contents = '<html><body><p style="asdas" class="asd asdas">lorem ipsum 1</p><p>lorem ipsum 2</p><br></body></html>';

preg_match_all('/\<p(?:[^>]*)?>(.*)<\/p>/is', $page_contents, $paragraphs_matches);

print_r($paragraphs_matches);

and it matches this:

lorem ipsum 1</p><p>lorem ipsum 2

how to modify it so it matches this ?:

lorem ipsum 1
lorem ipsum 2
adrianTNT
  • 3,671
  • 5
  • 29
  • 35
  • If you already know that you need to use the DOM (since you used this tag), why do you ask a question about the way to solve your problem with regex? – Casimir et Hippolyte Apr 18 '17 at 19:46
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – epascarello Apr 18 '17 at 19:47
  • @Casimir because I keep hearing about DOM when I look for an answer :) but I tried some DOM functions that didn't work. – adrianTNT Apr 18 '17 at 19:47
  • 1
    If you want to do it by regex, then you need a nongreedy match. Regex searches are greedy by default -> thay want to match as many characters as possible. Therefore your regex matches from first

    to the last one found. To make it nongreedy use `?` -> `.*` - greedy, `.*?` - nongreedy (will match for as little characters as possible). Docs: http://docstore.mik.ua/orelly/webprog/pcook/ch13_05.htm

    – Viliam Aboši Apr 18 '17 at 19:49
  • Oh, you *"tried some DOM functions that didn't work"*! In this case, post your tries that didn't work. – Casimir et Hippolyte Apr 18 '17 at 19:50
  • @Viliam `'/\

    ]*)?>(.*?)<\/p>/i'` thank you.

    – adrianTNT Apr 18 '17 at 19:51
  • @Casimir I tried a custom class (`simple_html_dom.php`) that had many functions and seems popular, but it seemed to do some unnecessary loops and could not get it to print what I need (a relatively simple thing). – adrianTNT Apr 18 '17 at 19:53
  • https://gist.github.com/4698d08b6473853141b6f4f7d4daa924 – Pedro Lobito Apr 18 '17 at 19:55

0 Answers0