PHP How to remove certain attributes from a body of text

Question

I have the following variable $text which fires out a load of HTML. Most of which is not useful to me for my purposes but some if it is.

HTML that comes out:

<div class="feed-item-description">
<ul>
<li><strong>Impact:</strong>&nbsp;Low</li>
<li><strong>Severity:</strong> <span class="label label-info">Low</span></li>
</ul>
...

What I'd like to do

I'd like to get the impact and the severity rating out of this text. I don't need the label.

I have tried doing this:

$itemAttributes = explode (':' , $text);

$impact     = $itemAttributes[3];
$severity   = $itemAttributes[4];

This does indeed seem to give me the attributes I want, but it also seems to call the word afterwards. It also behaves strangely in that even if I trim it, I cannot get rid of the preceding space from my output.

It also seems to close a <div> behind it, which I can't explain. I'm sure I'm about to get shouted down about using Regex for HTML, but I figured there must be a way to get something so simple out as it's the same words each time preceding the information I want.

If you want to see the actual output on a page you can see it here: https://dev.joomlalondon.co.uk/ you can see in the output I generate that it closes the <div class="feed-item-description"> but I don't tell it to do that anywhere, and the output I use is contained within an <li> not a <div>.

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

Maybe,

^\h*(Impact:)\s+(.*)|^\h+(Severity:)\s+(.*)

would simply return those desired values.

Test

$re = '/^\h*(Impact:)\s+(.*)|^\h+(Severity:)\s+(.*)/m';
$str = 'Project: Joomla!
    SubProject: CMS
    Impact: Low
    Severity: Low
    Versions: 3.6.0 - 3.9.12
    Exploit type: Path Disclosure
    Reported Date: 2019-November-01
    Fixed Date: 2019-November-05
    CVE Number: CVE-2019-18674

Description

Missing access check in the phputf8 mapping files could lead to an path disclosure.
Affected Installs

Joomla! CMS versions 3.6.0 - 3.9.12';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);

Output

array(2) {
  [0]=>
  array(3) {
    [0]=>
    string(15) "    Impact: Low"
    [1]=>
    string(7) "Impact:"
    [2]=>
    string(3) "Low"
  }
  [1]=>
  array(5) {
    [0]=>
    string(17) "    Severity: Low"
    [1]=>
    string(0) ""
    [2]=>
    string(0) ""
    [3]=>
    string(9) "Severity:"
    [4]=>
    string(3) "Low"
  }
}

If you wish to simplify/update/explore the expression, it's been explained on the top right panel of regex101.com. You can watch the matching steps or modify them in this debugger link, if you'd be interested. The debugger demonstrates that how a RegEx engine might step by step consume some sample input strings and would perform the matching process.

RegEx Circuit

jex.im visualizes regular expressions:

Man that's a much better website than the RegEx website I've been using! I got `array(0) { } ` but I can at least start testing better thank you. — Eoin, Dec 07 '19 at 02:07
It has also really helped me to dump the entire source code instead of just the text output as that has meant I can see that my match needs to be far more specific. But I think I can get something out of this now. Thanks! It's really helped me to identify the problem areas, and using groups like you have has opened my eyes a bit too. https://regex101.com/r/pnTeCT/1/ — Eoin, Dec 07 '19 at 02:17

score 0 · Answer 2 · edited Dec 07 '19 at 22:07

0

Because you should really use DOMDocument to parse HTML, here's a solution using it:

$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$feed_items = $xpath->query('//div[contains(@class, "feed-item-description")]');
foreach ($feed_items as $feed_item) {
    $impact_node = $xpath->query('//li[contains(string(), "Impact:")]', $feed_item);
    $impact = preg_replace('/Impact:\W*/', '', $impact_node->item(0)->textContent);
    echo $impact . "\n";
    $severity_node = $xpath->query('//li[contains(string(), "Severity:")]', $feed_item);
    $severity = preg_replace('/Severity:\W*/u', '', $severity_node->item(0)->textContent);
    echo $severity . "\n";
}

Output (for your sample HTML)

Low
Low

Demo on 3v4l.org

edited Dec 07 '19 at 22:07

Eoin

1,413
2
17
32

answered Dec 07 '19 at 03:19

Nick

138,499
22
57
95

Oddly I get `DOMNodeList Object ( [length] => 0 )` when I `print_r($feed_items);` – Eoin Dec 07 '19 at 15:57
I just ran this on the current version of https://dev.joomlalondon.co.uk/ and I get a list of 8 objects - however the ones in the security section are all empty divs e.g. `
` as the text seems to have been replaced with an image since first looking at this... – Nick Dec 07 '19 at 22:07
The only change I made to the code above was to add `$html = file_get_contents('https://dev.joomlalondon.co.uk/');` Unfortunately I can't add that to the demo as 3v4l.org don't allow url access. – Nick Dec 07 '19 at 23:23
Yes the output has changed, I got the regex working. Is there a benefit to using your method e.g. speed or anything like that? I have the result I want so that's not an issue any more. – Eoin Dec 09 '19 at 13:06
The only change I made was `$html` became `text` which has the HTML I want to scan in it – Eoin Dec 09 '19 at 13:08
I have updated dev.joomlalondon.co.uk to make it clearer. Underneath *Joomla! Security* you will see a blue card, within that, you will see a white background with bullet points starting: *Project: Joomla!* and that is the HTML that I wanted to capture and output (all contained within `$text`). As I already have it working it's no issue but to improve my knowledge (and perhaps for ease in the future) I'm trying to understand if this is the best route moving forwards. – Eoin Dec 09 '19 at 13:28
1

@Eoin With your changes the code above comes up with 8 sets of `Low` and `Low`, as expected. For an amusing read about using regex to parse HTML, see [this](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). The biggest issue with using regex to parse HTML is when you get nested tags e.g. you're trying to match the outer `
` in `
...
...
...
– Nick Dec 09 '19 at 21:08
I read that as part of my research prior to or after posting this question. Hilarious answer, not all that helpful but brilliant all the same. I think I will try your method too just so I understand both ways of doing things. – Eoin Dec 10 '19 at 22:39