0

Gah. I've spent a fair amount of time trying to find how to do this both properly and even hackishly, and I am just stumped. I have 2500+ HTML files that I've downloaded from a site, and I need only to extract a limited amount of information from any given page: the title of the talk described by the page (so I can collate this data with a giant CSV we already have), and then the event at which a given talk was given, and the date on which the talk was published.

The HTML for these pages is sprawling and filled with <script> elements. I want only the one that is followed by a q. The line that starts this block looks like this:

<script>q("talkPage.init", {

What follows is quite a bit of data. I need only the three items that look like this:

"event":"TEDGlobal 2005",
"filmed":1120694400,
"published":1158019860,

Luckily, "filmed" and "published" only occur once in this large block, but "event" occurs several times. It's always the same, so I don't care which of these any script grabs.

My thought was to use BeautifulSoup to find the <script>q element and then pass that onto the json module to parse, but I cannot figure out hot to tell soup to grab the <script> element followed by a q -- classes and ids are easy. Followed by ... not so much.

To begin working on the JSON portion, I've created a text file which has only the contents of the <script>q element in it, but I confess that getting the json module to load this is not working terribly well.

The code I have for the experiment first loads the text file with the JSON block I'm interested in, and then tries to decode it so I can do other things with it:

import json

text = open('dawkins_script_element.txt', 'r').read()
data = json.loads(text)

But clearly the JSON decoder doesn't like what I have, because it throws a ValueError: Expecting value: line 1 column 1 (char 0). Bah!

Here's what the first three lines of this script element looks like:

<script>q("talkPage.init", {
"el": "[data-talk-page]",
"__INITIAL_DATA__":

And that is where I am at the current moment. Any light that can be shed on either the soup or the json to get this done would be much appreciated.

John Laudun
  • 407
  • 1
  • 9
  • 19

3 Answers3

3

Without knowing the full context, here's a poor man's attempt:

Assuming your html looks something like this:

<script>foo</script>
<script>bar</script>
<script>q("talkPage.init",{
"foo1":"bar1",
"event":"TEDGlobal 2005",
"filmed":1120694400,
"published":1158019860,
"foo2":"bar2"
})</script>
<script>q("talkPage.init",{
"foo1":"bar1",
"event":"foobar",
"filmed":123,
"published":456,
"foo2":"bar2"
})</script>
<script>foo</script>
<script>bar</script>

You can code like this:

res = requests.get(url) # your link here
soup = bs4.BeautifulSoup(res.content)
my_list = [i.string.lstrip('q("talkPage.init", ').rstrip(')') for i in soup.select('script') if i.string and i.string.startswith('q')]

# my_list should now be filled with all the json text that is from a <script> tag followed by a 'q'
# note that I lstrip and rstrip on the script based no your sample (assuming there's a closing bracket), but if the convention is different you'll need to update that accordingly.

#...#
my_jsons = []
for json_string in my_list:
    my_jsons.append(json.loads(json_string))

# parse your my_jsons however you want.

Then you can start interpreting the jsons:

print(my_jsons[0]['event'])
print(my_jsons[0]['filmed'])
print(my_jsons[0]['published'])

# Output:
# TEDGlobal 2005
# 1120694400
# 1158019860

There's a lot of assumptions here. That's assuming all your texts within <script>q elements will always be starting with q("talkPage.init", and ending with a ). Also it's assuming the texts returned follow the json format for your next stage of parsing. I'm also assuming you have knowledge of how to parse the json results.

r.ook
  • 13,466
  • 2
  • 22
  • 39
  • I'm working with both these answers to see which one works best. I'll have more tomorrow first thing. Thanks for such a thoughtful answer! – John Laudun Jan 19 '18 at 03:57
  • Working through this, I have to use `soup = BeautifulSoup(text, "html5lib")` because when I try, `soup = BeautifulSoup(text.content)`, I get `AttributeError: 'str' object has no attribute 'content'`. I then use the list comprehension, which produces a list one item long, but it has the section we want. Right now, I'm stumbling on the JSON parsing, so that's what I will spend my morning working on. Thanks! – John Laudun Jan 19 '18 at 14:15
  • It depends what your original content is - if it was a `request` object (in my sample) then you'll want to get the text `content` for BeautifulSoup to parse. In your case it seems yours is a `str` object so just passing the `str` itself is good enough. If you're having trouble parsing the json I recommend you get a bit more info on the [`json` module](https://docs.python.org/3/library/json.html). – r.ook Jan 19 '18 at 17:10
  • I've definitely learned how fussy the JSON module is. Some may find this pretty ugly, but to get the JSON module to parse the string I had to do this: `pre_json = '{"' + "".join(my_list)` and then `my_json = json.loads(pre_json)` – John Laudun Jan 19 '18 at 17:24
  • @JohnLaudun that sounds more like the string being passed to the `json` module isn't properly formatted. You'll want to ensure the string result you parsed from the html can be read as a dictionary. If it's missing the `{` then it might have been stripped from the html parsing by mistake. – r.ook Jan 19 '18 at 17:34
  • I confess I'm coming to hate the `json` module: it wasn't parsing a bracketed series `[{"spam":"eggs" ... }], so I got rid of the brackets using `translation_table = dict.fromkeys(map(ord, '[]'), None)`, but now it appears to be choking on a perfectly good property. I'm going to restart the kernel for this notebook and see what happens... – John Laudun Jan 19 '18 at 18:05
  • So the problem is an embedded list inside the JSON on the page: `[{'more_resources': None, 'take_action': None, ... 'event': 'TEDGlobal 2005'}]`. The `json` module wants double quotes. – John Laudun Jan 19 '18 at 18:23
  • The `json` module is actually pretty easy to use IMO. Parsing nested jsons might be a pain but it is what it is. You can easily fix the single quotes by doing a `.replace("'",'"')` (first param is single quote in double quotes, second is double quote in single quotes) on the text before you pass onto `json`. For parsing, you might want to look into relevant threads like [this one](https://stackoverflow.com/questions/21028979/recursive-iteration-through-nested-json-for-specific-key-in-python). – r.ook Jan 19 '18 at 18:50
  • I agree that it's probably easy when you have proper JSON. Sadly, I don't. I thought about replacing the single quotes with double quotes, but I have double quotes within some of the values, so I was pretty sure it would break. What I've done with, for now, is turning the value that is in fact a one-item list into a string, splitting on commas, and then filtering on some regex for the properties I want. It ain't pretty, but if it will work for the 2500 texts I have, I'll like it well enough. – John Laudun Jan 19 '18 at 19:36
2

You can use a regex expression to match the part you want.

import re
# Filters the script-tag all the way to end ')' of q.
scipt_tag = re.findall(r'<script>q\((?s:.+)\)', t)
json_content = re.search(r'(?<=q\()(?s:.+)\)', script_tag[0]).group()
json_content = json_content[:-1]  # Strip last ')'

To find the stuff you need you can either use pythons json library to parse it or match the last things with what you want. Since filmed and published are unique and event doesn't differ (as I understood?)

import json
json_content = json.loads(json_content)
json_content['event']  # or whatever

OR

def get_val(a):
re.search('r(?<=' + a + r'\": )(.+)').group(0)

The latter needs to be filtered a bit to remove trailing ]" and preceding "[, or what not you want from it.

I've heard beautifulsoup is also a good library for matching html-stuff, but im not so familiar with it.

Three
  • 68
  • 1
  • 8
  • I like the idea of doing this with `regex`, which has always exceeded my grasp, but the line `script_tag = re.findall(r' – John Laudun Jan 19 '18 at 14:09
0

Here's the script I ended up using, with real thanks to both @Idlehands and @Three. To reach into the weird single-quoted JSON, I took the entire JSON element and read it into a list, split on commas. It's a hack, but it mostly works.

def get_metadata(the_file):

    # Load the modules we need
    from bs4 import BeautifulSoup
    import json
    import re
    from datetime import datetime

    # Read the file, load it into BS, then grab section we want
    text = the_file.read()
    soup = BeautifulSoup(text, "html5lib")
    my_list = [i.string.lstrip('q("talkPage.init", {\n\t"el": "[data-talk-page]",\n\t "__INITIAL_DATA__":')
               .rstrip('})')
               for i in soup.select('script') 
               if i.string and i.string.startswith('q')]

    # Read first layer of JSON and get out those elements we want
    pre_json = '{"' + "".join(my_list)
    my_json = json.loads(pre_json)
    slug = my_json['slug']
    vcount = my_json['viewed_count']
    event = my_json['event']

    # Read second layer of JSON and get out listed elements:
    properties = "filmed,published" # No spaces between terms!
    talks_listed = str(my_json['talks']).split(",")
    regex_list = [".*("+i+").*" for i in properties.split(",")]
    matches = []
    for e in regex_list:
        filtered = filter(re.compile(e).match, talks_listed)
        indexed = "".join(filtered).split(":")[1]
        matches.append(indexed)
    filmed = datetime.utcfromtimestamp(float(matches[0])).strftime('%Y-%m-%d')
    # published = datetime.utcfromtimestamp(float(matches[1])).strftime('%Y-%m-%d')
    return slug, vcount, event, filmed, #published
John Laudun
  • 407
  • 1
  • 9
  • 19