0

I am tasked with pulling data from HTML where I need to get arrays of data for each set of p tags in the HTML. Here is a sample bit of HTML.

<p class="ft01" style="margin: 0; padding: 0; font-size: 16px; font-family: Times; color: #000000; position: absolute; top: 103px; left: 63px; white-space: nowrap;">Title </p>
<p class="ft01" style="margin: 0; padding: 0; font-size: 16px; font-family: Times; color: #000000; position: absolute; top: 103px; left: 349px; white-space: nowrap;">1234 </p>
<p class="ft01" style="margin: 0; padding: 0; font-size: 16px; font-family: Times; color: #000000; position: absolute; top: 103px; left: 461px; white-space: nowrap;">$30 </p>
<p class="ft01" style="margin: 0; padding: 0; font-size: 16px; font-family: Times; color: #000000; position: absolute; top: 103px; left: 563px; white-space: nowrap;">$10,000,000 </p>
<p class="ft01" style="margin: 0; padding: 0; font-size: 16px; font-family: Times; color: #000000; position: absolute; top: 103px; left: 777px; white-space: nowrap;">3,000,000 </p>

This HTML will repeat a number of times with the "Title" and "1234" labels staying the same, and then switching to different labels at a certain point. The "top" and "left" values will constantly change throughout the HTML. I have the ability to loop through existing "Title" and "1234" labels to match against that portion of things.

$title_label = 'Title';
$number_label = '1234';
preg_match_all('%\d{2}px; white-space: nowrap;">$title_label </p>%', $html_content, $array_match);
$array_cost_name = $array_match[1];
$array_return_name = $array_match[2];
$array_number_name = $array_match[3];

I would then need the 3 arrays to contain the last 3 label fields. In the case of the sample HTML provided I would want "$30", "$10,000,000" and "3,000,000" to be the first values of each array.

I am not sure how to go about writing a regular expression to handle this situation. Can anyone help?

user1110562
  • 393
  • 1
  • 9
  • 29
  • Consider using a Dom parser with php, it will be more efficient. https://www.php.net/manual/fr/domdocument.loadhtmlfile.php – Lounis Feb 20 '20 at 19:26

3 Answers3

1

Regular expressions are not the right tool for this task, a XML parser is a lot easier :

$html = '<p class="ft01" style="margin: 0; padding: 0; font-size: 16px; font-family: Times; color: #000000; position: absolute; top: 103px; left: 63px; white-space: nowrap;">Title </p>
<p class="ft01" style="margin: 0; padding: 0; font-size: 16px; font-family: Times; color: #000000; position: absolute; top: 103px; left: 349px; white-space: nowrap;">1234 </p>
<p class="ft01" style="margin: 0; padding: 0; font-size: 16px; font-family: Times; color: #000000; position: absolute; top: 103px; left: 461px; white-space: nowrap;">$30 </p>
<p class="ft01" style="margin: 0; padding: 0; font-size: 16px; font-family: Times; color: #000000; position: absolute; top: 103px; left: 563px; white-space: nowrap;">$10,000,000 </p>
<p class="ft01" style="margin: 0; padding: 0; font-size: 16px; font-family: Times; color: #000000; position: absolute; top: 103px; left: 777px; white-space: nowrap;">3,000,000 </p>';

$doc = new DOMDocument();
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);

$parts = $xml->xpath('//p[@class="ft01"]/text()'); // find all texts inside p tags, with class ft01

$array_cost_name = (string) $parts[2];
$array_return_name = (string) $parts[3];
$array_number_name = (string) $parts[4];

echo $array_cost_name ; // $30
echo $array_return_name ; // $10,000,000
echo $array_number_name ; // 3,000,000
Joffrey Schmitz
  • 2,393
  • 3
  • 19
  • 28
0

You could use a simple global regex /ace: nowrap;">(.*) <\/p>/ or anything along the line to get the group you're looking for, and then removing the first 2 items to get only the last 3. Here's an example and a link to test it.

$html_content = '<p class="ft01" style="margin: 0; padding: 0; font-size: 16px; font-family: Times; color: #000000; position: absolute; top: 103px; left: 63px; white-space: nowrap;">Title </p>
<p class="ft01" style="margin: 0; padding: 0; font-size: 16px; font-family: Times; color: #000000; position: absolute; top: 103px; left: 349px; white-space: nowrap;">1234 </p>
<p class="ft01" style="margin: 0; padding: 0; font-size: 16px; font-family: Times; color: #000000; position: absolute; top: 103px; left: 461px; white-space: nowrap;">$30 </p>
<p class="ft01" style="margin: 0; padding: 0; font-size: 16px; font-family: Times; color: #000000; position: absolute; top: 103px; left: 563px; white-space: nowrap;">$10,000,000 </p>
<p class="ft01" style="margin: 0; padding: 0; font-size: 16px; font-family: Times; color: #000000; position: absolute; top: 103px; left: 777px; white-space: nowrap;">3,000,000 </p>';

preg_match_all('/ace: nowrap;">(.*) <\/p>/', $html_content, $array_match);

$array_match = array_slice($array_match[0], 2); ;

print_r($array_match);

http://sandbox.onlinephpfunctions.com/code/5ac69d44ff8168b4b21133c46dfa9c6db6986b6a

Alexandre Elshobokshy
  • 10,720
  • 6
  • 27
  • 57
0

Through regex you can try this way:

\preg_match_all('/<p.*>(.*)<\/p>/', $html, $out);
$result = $out[1];

This will capture all characters between <p></p> tag.

Lounis
  • 597
  • 7
  • 15