2

I am trying to extract a specific JavaScript object from a page containing the usual HTML markup.

I have tried to use regex but i don't seem to be able to get it to parse the HTML correctly when the HTML contains a line break.

An example can be seen here: https://regex101.com/r/b8zN8u/2

The HTML i am trying to extract looks like this:

<script>
  DATA.tracking.user = { 
      age: "19", 
      name: "John doe" 
  }
</script>

Using the following regex: DATA.tracking.user=(.*?)}

<?php
$re = '/DATA.tracking.user = (.*?)\}/m';
$str = '<script>
           DATA.tracking.user = { age: "19", name: "John doe" }
        </script>';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

If i parse DATA.tracking.user = { age: "19", name: "John doe" } without any linebreaks, Then it works fine but if i try to parse:

DATA.tracking.user = { 
      age: "19", 
      name: "John doe" 
  }

It does not like dealing with the line breaks.

Any help would be greatly appreciated.

Thanks.

Jrad51
  • 138
  • 1
  • 10

4 Answers4

3

You will need to specify whitespaces (\s) in your pattern in order to parse the javascript code containing linebreaks.

For example, if you use the following code:

<?php
$re = '/DATA.tracking.user = \{\s*.*\s*.*\s*\}/';
$str = '<script>
  DATA.tracking.user = {
      age: "19",
      name: "John doe"
  }
</script>';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
print_r($matches[0]);
?>

You will get the following output:

Array
(
    [0] => DATA.tracking.user = {
      age: "19",
      name: "John doe"
  }
)
pgngp
  • 1,552
  • 5
  • 16
  • 26
1

You need to add the 's' modifier to the end of your regex - otherwise, "." does not include newlines. See this:

s (PCRE_DOTALL)

If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.

So basically change your regex to be:

'/DATA.tracking.user = (.*?)\}/ms'

Also, you should quote your other dots (otherwise you will match "DATAYtrackingzZuser". So...

'/DATA\.tracking\.user = (.*?)\}/ms'

I'd also add in the open curly bracket and not enforce the single space around the equal sign, so:

'/DATA\.tracking\.user\s*=\s*\{(.*?)\}/ms'
Community
  • 1
  • 1
wordragon
  • 1,297
  • 9
  • 16
  • wordragon, Thank you for providing support with the issue i had, Whilst this worked and made my scraper run, I have accepted the answer of mickmackusa due to his answer making my life easier when working with the scraped key values. Thanks. – Jrad51 May 09 '18 at 07:37
1

The simple solution to your problem is to use the s pattern modifier to command the . (any character) to also match newline characters -- which it does not by default.

And you should:

  • escape your literal dots.
  • write the \{ outside of your capture group.
  • omit the m pattern modifier because you aren't using anchors.

...BUT...

If this was my task and I was going to be processing the data from the extracted string, I would probably start breaking up the components at extraction-time with the power of \G.

Code: (Demo) (Pattern Demo)

$htmls[] = <<<HTML
DATA.tracking.user = { age: "19", name: "John doe", int: 55 } // This works
HTML;

$htmls[] = <<<HTML
DATA.tracking.user = { 
    age: "20", 
    name: "Jane Doe",
    int: 49
} // This does not works
HTML;

foreach ($htmls as $html) {
    var_export(preg_match_all('~(?:\G(?!^),|DATA\.tracking\.user = \{)\s+([^:]+): (\d+|"[^"]*")~', $html, $out, PREG_SET_ORDER) ? $out : []);
    echo "\n --- \n";
}

Output:

array (
  0 => 
  array (
    0 => 'DATA.tracking.user = { age: "19"',
    1 => 'age',
    2 => '"19"',
  ),
  1 => 
  array (
    0 => ', name: "John doe"',
    1 => 'name',
    2 => '"John doe"',
  ),
  2 => 
  array (
    0 => ', int: 55',
    1 => 'int',
    2 => '55',
  ),
)
 --- 
array (
  0 => 
  array (
    0 => 'DATA.tracking.user = { 
    age: "20"',
    1 => 'age',
    2 => '"20"',
  ),
  1 => 
  array (
    0 => ', 
    name: "Jane Doe"',
    1 => 'name',
    2 => '"Jane Doe"',
  ),
  2 => 
  array (
    0 => ',
    int: 49',
    1 => 'int',
    2 => '49',
  ),
)
 --- 

Now you can simply iterate the matches and work with [1] (the keys) and [2] (the values). This is a basic solution, that can be further tailored to suit your project data. Admittedly, this doesn't account for values that contain an escaped double-quote. Adding this feature would be no trouble. Accounting for more complex value types may be more of a challenge.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
  • 1
    mickmackusa, Thank you for taking the time to help out. I have accepted this as the answer due to the fact this best suited my requirements and made working with the $key $values much easier. – Jrad51 May 09 '18 at 07:35
0

Since you seem to be scraping/reading the page anyway (so you have a local copy), you can simply replace all the newline characters in the HTML page with whitespace characters, then it should work perfectly without even changing your script.

Refer to this for the ascii values:
https://www.techonthenet.com/ascii/chart.php

ron wizzle
  • 39
  • 6