Gah. I've spent a fair amount of time trying to find how to do this both properly and even hackishly, and I am just stumped. I have 2500+ HTML files that I've downloaded from a site, and I need only to extract a limited amount of information from any given page: the title of the talk described by the page (so I can collate this data with a giant CSV we already have), and then the event at which a given talk was given, and the date on which the talk was published.
The HTML for these pages is sprawling and filled with <script>
elements. I want only the one that is followed by a q
. The line that starts this block looks like this:
<script>q("talkPage.init", {
What follows is quite a bit of data. I need only the three items that look like this:
"event":"TEDGlobal 2005",
"filmed":1120694400,
"published":1158019860,
Luckily, "filmed"
and "published"
only occur once in this large block, but "event"
occurs several times. It's always the same, so I don't care which of these any script grabs.
My thought was to use BeautifulSoup to find the <script>q
element and then pass that onto the json module to parse, but I cannot figure out hot to tell soup to grab the <script>
element followed by a q -- classes and ids are easy. Followed by ... not so much.
To begin working on the JSON portion, I've created a text file which has only the contents of the <script>q
element in it, but I confess that getting the json module to load this is not working terribly well.
The code I have for the experiment first loads the text file with the JSON block I'm interested in, and then tries to decode it so I can do other things with it:
import json
text = open('dawkins_script_element.txt', 'r').read()
data = json.loads(text)
But clearly the JSON decoder doesn't like what I have, because it throws a ValueError: Expecting value: line 1 column 1 (char 0)
. Bah!
Here's what the first three lines of this script element looks like:
<script>q("talkPage.init", {
"el": "[data-talk-page]",
"__INITIAL_DATA__":
And that is where I am at the current moment. Any light that can be shed on either the soup or the json to get this done would be much appreciated.