Finding a line/string of text in HTML using DOM

Question

I have some Plain Text/HTML Content like so:

Title: Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Snippet: Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Category: Lorem ipsum dolor sit amet, consectetur adipiscing elit.

and i want to match only the line where it says "Snippet: and the text that follows it, BUT ONLY on that line, nothing else, and also making the search case-insensitive. I tried with regular expressions, but ultimately i want to attempt using DOMDocument now, how can i do this?

possible duplicate of [Ignore html tags in preg_replace](http://stackoverflow.com/questions/8193327/ignore-html-tags-in-preg-replace) - See the `TextRange` class there, the strings it provides is compatible with pcre and the UTF-8 `u`-modifier. — hakre, May 07 '12 at 15:33

hakre · Answer 1 · 2012-05-07T16:32:01.647

In case DOM is concerned, see the duplicate I linked in a comment.

Otherwise you might just look for a regular expression:

$line = preg_match('~(^Snippet:.*$)~m', $text, $matches) ? $matches[1] : NULL;

Demo and Regex Explained:

~  -- delimiter
 (  -- start match group 1
  ^  -- start of line
    Snippet:  -- exactly this text
    .*  -- match all but newline
  $  -- end of line
 )  -- end match group 1
~  -- delimiter
m  -- multiline modifier (^ matches begin of line, $ end of line)

score 1 · Accepted Answer · answered May 07 '12 at 15:40

I don't know some details about your problem, so my answer might not be appropriate. You could decide based on the size of the content you need to parse that this is not an option. Also, from the question it is not clear where the html content comes into place, that is why I wrote this solution that doesn't use DOM parsing.

A possible solution might be to get the lines that you want to parse in an array. After that you can filter the array, removing the lines that don't match your rule from the result.

A sample would be:

//this is the content
$text = 'Title: Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Snippet: Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Category: Lorem ipsum dolor sit amet, consectetur adipiscing elit.';

//get the lines from your input as an array.. you could acheive this in a different way if, for example, you are reading from a file
$lines = explode(PHP_EOL, $text);

// apply a cusom function to filter the lines (remove the ones that don't match your rule)
$results = array_filter($lines, 'test_content');

//show the results
echo '<pre>';
print_r($results);
echo '</pre>';

//custom function here:
function test_content($line)
{
    //case insensitive search, notice stripos; 
    // type strict comparison to be sure that it doesn't fail when the element is found right at the start
    if (false !== stripos($line, 'Snippet'))
    {
        return true;
    }
    return false;//these lines will be removed 
}

that piece of code will return only one element in the $results array, the second line

you can see it at work here: http://codepad.org/220BLjEk

I'm going to give this a try and let you know how it goes. – Tower May 07 '12 at 15:45 — Tower, May 07 '12 at 15:45

Finding a line/string of text in HTML using DOM

2 Answers2

Linked