Python search for a string in a string and get what's behind that string

Question

So I am trying to receive a JSON string from a BIG string. The string is the source of a page. But all the way at the end of the string it says <script>json='[46801158,105847139,"N\/A"]'</script> (What is variable). There are no other <script> tags. So I ask, how can I receive the JSON, i.e. only [46801158,105847139,"N\/A"].

Thanks,

You should look into string manipulation and especially the `split()` method. You could split your big string at `` and then take the left side. — Mathieu, Feb 13 '19 at 14:15
Thanks @Mathieu ! I think I will come there with your comment. — Aaron Jonk, Feb 13 '19 at 14:17
If you're working with HTML, I highly suggest using an HTML parser and not regex as others may suggest. — Nordle, Feb 13 '19 at 14:18
@MattB.I agree, but with the information provided it is not really possible to answer conclusively with an HTML parser — Adam Dadvar, Feb 13 '19 at 14:22
@MattB. To me your comment sounds a bit too dogmatic. Could you please link to a discussion on this issue, please (I know the comments are no place for a discussion)? After all, what folks suggest below does the job, right? (and in linear time and const memory). — sophros, Feb 13 '19 at 14:26
@sophros apologies, I'm not trying to sound as if my word is final, however it is very well documented that you cannot parse HTML with regex, of course there are instances that work without issue however there are so many edge cases that cannot be taken into account that the safest option is to use the tool for the job (a HTML parser). Here's a link on SO that highlights some issues - https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Nordle, Feb 13 '19 at 14:32

score 1 · Answer 1 · answered Feb 13 '19 at 14:18

One way to do it:

big_string = """blablabla<script>json='[46801158,105847139,"N\/A"]'</script>blablabla"""

final = big_string.split("<script>")[1].split("</script>")[0][:-1].strip("json='")

Output:

'[46801158,105847139,"N\\/A"]'

This is only using basic string manipulation. Other solutions exist.

score -1 · Answer 2 · answered Feb 13 '19 at 14:17

-1

You can match on the whole json part, and use a group to match the internal contents: json='(.+)'

A working example on regexr.

This would return [46801158,105847139,"N\/A"] in group #1.

answered Feb 13 '19 at 14:17

Jim Wright

5,905
1
15
34

We really shouldn't be using regex to parse HTML! – Nordle Feb 13 '19 at 14:19

score -2 · Answer 3 · answered Feb 13 '19 at 14:21

-2

You could use regex:

>>> from re import findall
>>> findall(r"<script>json='(.+)'</script>", """<script>json='[46801158,105847139,"N\/A"]'</script>""")
['[46801158,105847139,"N\\/A"]']

This uses the regex <script>json='(.+)'</script> which looks for the script tags, and captures the group after the json parameter.

answered Feb 13 '19 at 14:21

Adam Dadvar

384
1
7

1

Your answer doesn't seem to add value over Jim Wright's one below. – sophros Feb 13 '19 at 14:22
1

which is also a bad answer, as we **shouldn't be parsing HTML with regex** – Nordle Feb 13 '19 at 14:22
@sophros the main added difference is that his does not look for the script tags, as specified in the question itself – Adam Dadvar Feb 13 '19 at 14:22

Python search for a string in a string and get what's behind that string

3 Answers3