-2

I have an extensive JS dictionary retrieved from an HTML webpage and I want to extract data from it without parsing the Javascript. Currently I am trying to accomplish this using Regular Expression.

The problem is that the dictionary is quite complex and dynamic, meaning that, on occasion, I could find some new keys inserted, yet I expect my target keys to stay the same.

This is highly trimmed data with some values ommited, but it maintains the complexity.

{"compactVideoRenderer":{"videoId":"abcDE123-_","thumbnail":{"thumbnails":[{"url":"OMMITED_URL","width":168,"height":94},{"url":"OMMITED_URL_TWO","width":336,"height":188}]},"title":{"accessibility":{"accessibilityData":{"label":"OMMITED_TITLE"}},"simpleText":"OMMITED_TITLE_SIMPLE"}}}

From the above, I need to extract the values of the following:

  • compactVideoRenderer -> videoId ("abcDE123-_")
  • compactVideoRenderer -> accessibility -> simpleText ("OMMITED_TITLE_SIMPLE")

The solution must be flexible enough that if I insert another key value pair at any location (as long as it does not change the 'address' of the target keys), the regex should still be able to find the target values.

As Regex is universal in terms of programming languages, code in any language will help, however, code or suggestions in Python are extra helpful!

brikas
  • 194
  • 1
  • 13

2 Answers2

0

Use https://pypi.org/project/jsonfinder/ to extract the JSON object from the HTML string. Then you can work with a normal Python dict. No regex needed.

Alex Hall
  • 34,833
  • 5
  • 57
  • 89
0

Why use regex when you can access the elements the natural way?

If you must, there are dupes: Python - Parsing JSON formatted text file with regex

In Python3 you can do

import json
from types import SimpleNamespace
# Parse JSON into an object with attributes corresponding to dict keys.
x = json.loads(data, object_hook=lambda d: SimpleNamespace(**d))
print(data.compactVideoRenderer.videoId)

In JS:

const data = JSON.parse(`{
  "compactVideoRenderer": {
    "videoId": "abcDE123-_",
    "thumbnail": {
      "thumbnails": [{
        "url": "OMMITED_URL",
        "width": 168,
        "height": 94
      }, {
        "url": "OMMITED_URL_TWO",
        "width": 336,
        "height": 188
      }]
    },
    "title": {
      "accessibility": {
        "accessibilityData": {
          "label": "OMMITED_TITLE"
        }
      },
      "simpleText": "OMMITED_TITLE_SIMPLE"
    }
  }
}`)

console.log(data.compactVideoRenderer.videoId)
console.log(data.compactVideoRenderer.title.simpleText)
mplungjan
  • 169,008
  • 28
  • 173
  • 236
  • OP needs Python – Pac0 Nov 08 '20 at 17:37
  • I imagine that using Regex is much more light performance and memory wise as rebuilding a huge dictionary in Python and then accessing just a few values. – brikas Nov 08 '20 at 17:39
  • _code in any language will help, however, code or suggestions in Python are extra helpful!_ but see update – mplungjan Nov 08 '20 at 17:39
  • If it was on a webpage to begin with, I assume it is not 100.000 elements – mplungjan Nov 08 '20 at 17:40
  • As an example, one JS script in the page has 300 000 chars. I will need to retrieve my target values from as many as 100 sources per request. I am running this on a light server and user experience in terms of waiting time is of high importance. Its one of the main reasons I am not doing this in Selenium. You may be right, the difference might be little, I may have to test the performance using the method you suggested. – brikas Nov 08 '20 at 17:46
  • Anyways, your answer is quite valuable, despite not being in Regex. I may look into it and try it this way. Thank you! – brikas Nov 08 '20 at 17:47
  • I added an answer for regex as a dupe. There are many more if you search for python regex json – mplungjan Nov 08 '20 at 17:48