-1

I have on the site something like this:

 <div class="latestItemIntroText">

        <div class="itemLinks">
            <div class="share">Share</div>
            <div class="dummy-div"></div>

            <div class="addthis_sharing_toolbox"></div>

        </div>
     Lorem ipsum <br /><br />
     Lorem ipsum <br /><br />
     Lorem ipsum <br /><br />
     Lorem ipsum <br /><br />

 </div>

I need to have this text Lorem ipsum only. I tryed to do this regex code like this:

</div>([\s?]+[^<]+[<br?/?>]*[^<]+[<br?/?>]*[^<]+[<br?/?>]*[^<]+)</div>

I saw that this part I repeat many times :

[^<]+[<br?/?>]* --> because I don't know how many times there will be br with lorem pisum, maybe one, maybe 10 times... is there a possibility to short this regex?

LukStorms
  • 28,916
  • 5
  • 31
  • 45
wenus
  • 1,345
  • 6
  • 24
  • 51
  • what are you doing the regex with? JS? php? – treyBake May 24 '17 at 14:36
  • Oh sorry I forgot add this info - PHP – wenus May 24 '17 at 14:37
  • 3
    that feel when people use regex for stuff they are supposed to use DOM for. clearly a x y problem – GottZ May 24 '17 at 14:39
  • Please format your code. Try adding four spaces at the beginning of your line that has the regex or wrapping in backticks (`). As it stands, I can't tell if the space in your closing div tag is supposed to be there or not. Also, don't parse html with regex. [It's generally a bad idea.](https://stackoverflow.com/a/1732454/854246) – Joseph Marikle May 24 '17 at 14:40
  • with an `innerHTML` or something like that – alessandrio May 24 '17 at 14:41
  • Your regex is all kinds of wrong anyway. – Niet the Dark Absol May 24 '17 at 14:49
  • Someone surely may post a solution that doesn't use regex. But you try to capture the tags in the regex the wrong way. `[]`is for a character class. `[yum]` captures a character that is y or u or m. Not the word `yum`. You need a capture group for it : `(yum)` or a non-capture group : `(?:yum)` or just `yum` – LukStorms May 24 '17 at 14:52
  • For example: `<\/div>(?=\s*\w)(.*?)<\/div>` – LukStorms May 24 '17 at 15:52

2 Answers2

2

Using Regex for HTML String is not a good approach, instead use DOMDocument for this.

Try this code snippet here

<?php
ini_set('display_errors', 1);
$string = <<<HTML
<div class="latestItemIntroText">

        <div class="itemLinks">
            <div class="share">Share</div>
            <div class="dummy-div"></div>

            <div class="addthis_sharing_toolbox"></div>

        </div>
     Lorem ipsum <br /><br />
     Lorem ipsum <br /><br />
     Lorem ipsum <br /><br />
     Lorem ipsum <br /><br />

 </div>
HTML;

$domDocument = new DOMDocument();
$domDocument->loadHTML($string);

$domXPath = new DOMXPath($domDocument);
$results = $domXPath->query('//div[@class="itemLinks"]');
$toRemove[]=$results->item(0);
foreach($toRemove as $removal)
{
    $removal->parentNode->removeChild($removal);
}
$results = $domXPath->query('//div[@class="latestItemIntroText"]');
print_r($results->item(0)->textContent);
Sahil Gulati
  • 15,028
  • 4
  • 24
  • 42
0

This simpler regex does work for your input. All the usual caveats about the million different ways this would break do apply.

^(?!.*(?:<div|</div>))(.+?)(?=<br\s?/>|$)
  1. Look ahead to not see any div tags, via negative lookahead.
  2. reluctantly capture text up to a <br/> or EOL, via positive lookahead.

https://regex101.com/r/ePaFrp/4/

Scott Weaver
  • 7,192
  • 2
  • 31
  • 43