1

I have a full page HTML scraped that have a lot of markup including HTML/CSS/JS code.

example below (stripped content)

<p>blah blah blah html</p>
<script type="text/javascript">window._userData ={"country_code": "PK", "language_code": "en",user:[{"user": {"username": "johndoe", "follows":12,"biography":"blah blah blah","feedback_score":99}}],"another_var":"another value"} </script>
<script> //multiple script tags can be here... </script>
<p>blah blah blah html</p>

Now I want to extract the object in window._userData and then if possible convert that extracted string into PHP object/array.

I have tried a few regular expressions found on SO but couldn't get it working.

I have also tried the similar answer here Regular expression extract a JavaScript variable in PHP

Thanks

Alyas
  • 620
  • 1
  • 10
  • 22
  • the object you want to exract is incorrect. – splash58 Jun 13 '16 at 10:28
  • @splash58 I have added the missing } , Thanks for comment, any solution please? – Alyas Jun 13 '16 at 10:30
  • 1
    moreover, it cannot contain spaces and must have all keys in quotes - `{"country_code":"PK","language_code":"en","user":[{"user":{"username": "johndoe","follows":12,"biography":"blah blah blah","feedback_score":99}}],"another_var":"another value"}' – splash58 Jun 13 '16 at 10:33
  • `/ – YOU Jun 13 '16 at 10:36

1 Answers1

2

find by regex

preg_match('/\bwindow\._userData\s*=(.+)(?=;|<\/script)/', $html, $m);

and decode

json_decode(trim($m[1]), true);

But before you should make correct json in that html.

splash58
  • 26,043
  • 3
  • 22
  • 34
  • This is the right anwer, but still you will have problems when the script tag contains more than one JS object and/or the object contains strings with `;`. If you can rule that out it will work. edit: JS is not a regular language therefor [this answer applies](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Johannes Stadler Jun 13 '16 at 10:43
  • 1
    @JohannesStadler if json contains `;` or EOL, its reallly a problem, i don't know how to solve – splash58 Jun 13 '16 at 10:48
  • I think it's not possible with regex. Js is not a regular language so regex has its limits. – Johannes Stadler Jun 13 '16 at 10:50
  • @JohannesStadler Yuo are right. Unfortunately, i don't know any library to parse js but js itself :). – splash58 Jun 13 '16 at 10:53