I am still at beginner level php at procedural style programming. Not into php libraries yet. And so want to use pure core plain php to extract words of a page than use DOM and the like.
I have an html file that looks like the following with headers & paragraphs:
<html>
<head>
<title>Title</title>
<head>
<meta charset="UTF-8">
<meta name="description" content="Free Web tutorials">
<meta name="keywords" content="HTML, CSS, JavaScript">
<meta name="author" content="John Doe">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
<article>
<h>H: Heading H</h>
This is heading
<p>P: This is paragraph.</p>
<h>h2: Heading 2</h>
This is heading 2
<p>p2: This is paragraph 2.</p>
<h>h: Heading 3</h>
This is heading 3
<p>p: This is paragraph 3.</p>
</article>
</body>
</html>
This is very weird! Consider the following code ...
<?php
$url = "http://localhost/Templates/one_page/bot/html_file.php";
$html = file_get_contents($url);
//ISSUE: Returns only one instance of the header.
function string_between_two_string($html, $starting_word, $ending_word)
{
$subtring_start = strpos($html, $starting_word);
//Adding the strating index of the starting word to
//its length would give its ending index
$subtring_start += strlen($starting_word);
//Length of our required sub string
$size = strpos($html, $ending_word, $subtring_start) - $subtring_start;
// Return the substring from the index substring_start of length size
return substr($html, $subtring_start, $size);
}
$heading = string_between_two_string($html, '<h>', '</h>');
echo $heading;
echo '<br>';
It echoes: H: Heading H
Now consider this code:
<?php
$url = "http://localhost/Templates/one_page/bot/html_file.php";
$html = file_get_contents($url);
//ISSUE: Returns all instances of the header plus the parapgraphs.
function string_between_two_string($html, $starting_word, $ending_word)
{
$subtring_start = strpos($html, $starting_word);
//Adding the strating index of the starting word to
//its length would give its ending index
$subtring_start += strlen($starting_word);
//Length of our required sub string
$size = strpos($html, $ending_word, $subtring_start) - $subtring_start;
// Return the substring from the index substring_start of length size
return substr($html, $subtring_start, $size);
}
$paragraph = string_between_two_string($html, '<p>', '</p>');
echo $paragraph;
echo '<br>';
It echoes:
l>
H: Heading H This is heading
P: This is paragraph.
h2: Heading 2 This is heading 2
p2: This is paragraph 2.
h: Heading 3 This is heading 3
p: This is paragraph 3.
Why "l>" gets echoed ?
NOTE: Note both script are 99% same. Only difference is the first one is designed to echo the headers and the second one designed to echo the paragraphs. My question is, if the first script only echoes the first header then should not the second script also echo the first paragraph rather than all of the paragraphs ? On the other hand, if the second script is echoing fine by echoing all of the paragraphs as it should then should not the first script also echo all the headers (H & h) instead of the first header only ? This is puzzling and a mystery! What's happening here ?
Anyway, which script should I work on to have all the headers dumped to an array ($headers=array()) and all the paragraphs dumped to another array ($paragraphs = array()) ? And how to write the lines to achieve this ?
I really need to know the answer to this mystery!
Thanks!
` to `
`.