Simple PHP code for extracting data from the HTML source code

Question

I know I can use xpath, but in this case it wouldn't work because of the complexity of the navigation of the site.

I can only use the source code.

I have browsed all over the place and couldn't find a simple php solution that would:

Open the HTML source code page (I already have an exact source code page URL).
Select and extract the text between two codes. Not between a div. But I know the start and end variables.

So, basically, I need to extract the text between

knownhtmlcodestart> Text to extract <knownhtmlcodeend

What I'm trying to achieve in the end is this:

Go to a source code URL.
Extract the text between two codes.
Store the data temporarily (define the time manually for how long) on my web server in a simple text file.
Define the waiting time and then repeat the whole process again.

The website that I'm going to extract data from is changing dynamically. So it would always store new data into the same file.

Then I would use that data (but that's a question for another time).

I would appreciate it if anyone could lead me to a simple solution.

Not asking to write a code, but maybe someone did anything similar and sharing the code here would be helpful.

Thanks

You can use Regex to extract the text as mentioned above. there is a solution for regex provided in https://stackoverflow.com/questions/2403122/regular-expression-to-extract-text-between-square-brackets if you are interested. — Kishen Nagaraju, Mar 24 '21 at 18:43

score 1 · Answer 1 · answered Mar 24 '21 at 19:06

I (shamefully) found the following function useful to extract stuff from HTML. Regexes sometimes are too complex to extract large stuff, e.g. a whole <table>

/*
   $start - string marking the start of the sequence you want to extract
   $end - string marking the end of it..
   $offset - starting position in case you need to find multiple occurrences
   returns the string between `$start` and `$end`, and the indexes of start and end
*/
function strExt($str, $start, $end = null, $offset = 0)
{
    $p1 = mb_strpos($str,$start,$offset);
    if ($p1 === false) return false;
    $p1 += mb_strlen($start);

    $p2 = $end === null ? mb_strlen($str) : mb_strpos($str,$end, $p1+1);
    return 
        [
            'str'   => mb_substr($str, $p1, $p2-$p1),
            'start' => $p1,
            'end'   => $p2];
}

score 1 · Answer 2 · answered Mar 24 '21 at 19:11

This would assume the opening and closing tag are on the same line (as in your example). If the tags can be on separate lines, it wouldn't be difficult to adapt this.

$html = file_get_contents('website.com');


$lines = explode("\n", $html); 

foreach($lines as $word) {
    $t1 = strpos($word, "knownhtmlcodestart");
    $t2 = strpos($word, "knownhtmlcodeend");
    
    if ($t1)
        $c1 = $t1;
    
    if ($t2)
        $c2 = $t2;
    
    if ($c1 && $c2){
        $text = substring($word, $c1, $c2-$c1);
        break;  
    }
}

echo $text;

Simple PHP code for extracting data from the HTML source code

2 Answers2