0

I am still at beginner level php at procedural style programming. Not into php libraries yet. And so want to use pure core plain php to extract words of a page than use DOM and the like.

I have an html file that looks like the following with headers & paragraphs:

<html>
<head>
<title>Title</title>
 <head>
  <meta charset="UTF-8">
  <meta name="description" content="Free Web tutorials">
  <meta name="keywords" content="HTML, CSS, JavaScript">
  <meta name="author" content="John Doe">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
<article>


<h>H: Heading H</h>
This is heading 
<p>P: This is paragraph.</p>

<h>h2: Heading 2</h>
This is heading 2
<p>p2: This is paragraph 2.</p>

<h>h: Heading 3</h>
This is heading 3
<p>p: This is paragraph 3.</p>
</article>
</body>
</html>

This is very weird! Consider the following code ...

<?php

$url = "http://localhost/Templates/one_page/bot/html_file.php";
$html = file_get_contents($url);


//ISSUE: Returns only one instance of the header.

function string_between_two_string($html, $starting_word, $ending_word)
{
    $subtring_start = strpos($html, $starting_word);
    //Adding the strating index of the starting word to 
    //its length would give its ending index
    $subtring_start += strlen($starting_word);  
    //Length of our required sub string
    $size = strpos($html, $ending_word, $subtring_start) - $subtring_start;  
    // Return the substring from the index substring_start of length size 
    return substr($html, $subtring_start, $size);  
}
  
$heading = string_between_two_string($html, '<h>', '</h>'); 
echo $heading; 
echo '<br>';

It echoes: H: Heading H

Now consider this code:

<?php

$url = "http://localhost/Templates/one_page/bot/html_file.php";
$html = file_get_contents($url);

//ISSUE: Returns all instances of the header plus the parapgraphs.

function string_between_two_string($html, $starting_word, $ending_word)
{
    $subtring_start = strpos($html, $starting_word);
    //Adding the strating index of the starting word to 
    //its length would give its ending index
    $subtring_start += strlen($starting_word);  
    //Length of our required sub string
    $size = strpos($html, $ending_word, $subtring_start) - $subtring_start;  
    // Return the substring from the index substring_start of length size 
    return substr($html, $subtring_start, $size);  
}
  
$paragraph = string_between_two_string($html, '<p>', '</p>'); 
echo $paragraph; 
echo '<br>';

It echoes:

l>
H: Heading H This is heading
P: This is paragraph.
h2: Heading 2 This is heading 2
p2: This is paragraph 2.
h: Heading 3 This is heading 3
p: This is paragraph 3.

Why "l>" gets echoed ?

NOTE: Note both script are 99% same. Only difference is the first one is designed to echo the headers and the second one designed to echo the paragraphs. My question is, if the first script only echoes the first header then should not the second script also echo the first paragraph rather than all of the paragraphs ? On the other hand, if the second script is echoing fine by echoing all of the paragraphs as it should then should not the first script also echo all the headers (H & h) instead of the first header only ? This is puzzling and a mystery! What's happening here ?

Anyway, which script should I work on to have all the headers dumped to an array ($headers=array()) and all the paragraphs dumped to another array ($paragraphs = array()) ? And how to write the lines to achieve this ?

I really need to know the answer to this mystery!

Thanks!

  • 2
    Why not parse the HTML properly using an HTML parser, instead of trying to perform string manipulation on it...? There are [way too many edge cases to parse HTML or other XML-style markup with string manipulation or RegExp functionality](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – esqew Dec 14 '21 at 17:54
  • 1
    I don't see how that script could ever return more than one element per call (or return different strings for any subsequential calls unless you change `$html` and/or the tag each time). Also, `` isn't a valid HTML tag. The HTML tag for headings (not headers) are `

    ` to `

    `.
    – M. Eriksson Dec 14 '21 at 17:55
  • Can you show a sample of the file? – Barmar Dec 14 '21 at 18:00
  • @Barmar, Sorry! I forgot to add the html code initially. I have now edited my post and added the html code at the beginning of my post. – studentprogrammer2020 Dec 14 '21 at 18:05
  • 1
    I can't reproduce the problem: https://ideone.com/cRnwYj – Barmar Dec 14 '21 at 18:07
  • @M. Erikson, Thanks for correcting my header mistakes. But can you please explain why the other script echoes more than one header (aswell as more than one paragraph) when the first script doesn't echo more than one header. Need to get out of my confusion. – studentprogrammer2020 Dec 14 '21 at 18:07
  • @Barmar, Did you try both scripts ? – studentprogrammer2020 Dec 14 '21 at 18:10
  • No, just the one that's not working. – Barmar Dec 14 '21 at 18:10
  • 1
    I've updated it to do both. They both work as expected. – Barmar Dec 14 '21 at 18:11
  • @esqew, Can you kindly show me how to extract the following using simple_html_dom: headers (h1-h6), meta tags, paragraphs. The first two are most important. - Thanks! – studentprogrammer2020 Dec 14 '21 at 18:12
  • 3
    @studentprogrammer2020 There are plenty of resources that already exist on this site (and elsewhere on the broader Internet) on how to accomplish this. If you have a specific part of something like `simple_html_dom` that you can't get working in the context of your specific use case, edit your question or ask a new one being sure to include the code you've already written as a [mre] along with an explanation of where *specifically* you're getting stuck in accordance with [ask]. – esqew Dec 14 '21 at 18:18

0 Answers0