0

So I am trying to receive a JSON string from a BIG string. The string is the source of a page. But all the way at the end of the string it says <script>json='[46801158,105847139,"N\/A"]'</script> (What is variable). There are no other <script> tags. So I ask, how can I receive the JSON, i.e. only [46801158,105847139,"N\/A"].

Thanks,

sophros
  • 14,672
  • 11
  • 46
  • 75
Aaron Jonk
  • 473
  • 2
  • 7
  • 21
  • 1
    You should look into string manipulation and especially the `split()` method. You could split your big string at `` and then take the left side. – Mathieu Feb 13 '19 at 14:15
  • Thanks @Mathieu ! I think I will come there with your comment. – Aaron Jonk Feb 13 '19 at 14:17
  • If you're working with HTML, I highly suggest using an HTML parser and not regex as others may suggest. – Nordle Feb 13 '19 at 14:18
  • @MattB.I agree, but with the information provided it is not really possible to answer conclusively with an HTML parser – Adam Dadvar Feb 13 '19 at 14:22
  • @MattB. To me your comment sounds a bit too dogmatic. Could you please link to a discussion on this issue, please (I know the comments are no place for a discussion)? After all, what folks suggest below does the job, right? (and in linear time and const memory). – sophros Feb 13 '19 at 14:26
  • @sophros apologies, I'm not trying to sound as if my word is final, however it is very well documented that you cannot parse HTML with regex, of course there are instances that work without issue however there are so many edge cases that cannot be taken into account that the safest option is to use the tool for the job (a HTML parser). Here's a link on SO that highlights some issues - https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Nordle Feb 13 '19 at 14:32

3 Answers3

1

One way to do it:

big_string = """blablabla<script>json='[46801158,105847139,"N\/A"]'</script>blablabla"""

final = big_string.split("<script>")[1].split("</script>")[0][:-1].strip("json='")

Output:

'[46801158,105847139,"N\\/A"]'

This is only using basic string manipulation. Other solutions exist.

Mathieu
  • 5,410
  • 6
  • 28
  • 55
-1

You can match on the whole json part, and use a group to match the internal contents: json='(.+)'

A working example on regexr.

This would return [46801158,105847139,"N\/A"] in group #1.

Jim Wright
  • 5,905
  • 1
  • 15
  • 34
-2

You could use regex:

>>> from re import findall
>>> findall(r"<script>json='(.+)'</script>", """<script>json='[46801158,105847139,"N\/A"]'</script>""")
['[46801158,105847139,"N\\/A"]']

This uses the regex <script>json='(.+)'</script> which looks for the script tags, and captures the group after the json parameter.

Adam Dadvar
  • 384
  • 1
  • 7