I'm using Selenium (Chromedriver) with Python 2.7 to scrape a website for some dynamic text that shows up in tags. Inside is HTML code nested in a JSON object, which is used to create a list of content on the page I'm viewing, but I'm only interested in getting the textual content. I was able to figure out how to clean out the HTML tags using re
, but the text still contains HTML character codes for special characters, which I want to replace with the character they correspond to.
So, for example, say my json (after cleaning out HTML tags) is as follows:
[
{
"data": {
"published_on": "2019-01-15T08:46:00+00:00",
"id": "somealphanumericid",
"short_description": "Albert Einstein’s Theory of Relativity: Should We Worry…?",
"series": "Science",
"long_description": "Albert Einstein does an interview with Mr. John Smith about the application of the theory of relativity, and what it could mean for the future of the pizza industry!",
"duration": "752000",
"type": "video",
"title": "Albert Einstein’s Theory of Relativity:"
},
"links": {
"permalink": "https://www.stackoverflow.com"
},
"key": "somealphanumericid"
},
...
]
Edit: The JSON object is actually an array of JSON objects, hence the []
. The site I'm scraping is paginated, so I obtain the JSON from each page and at the end just concatenate them into one array so it's easier to work with.
You can see that characters such as periods, commas, colons, etc are scraped as their corresponding HTML character codes.
Now, I'm iterating over the JSON and putting everything into an sqlite database, so it doesn't matter if I replace the character codes in the JSON itself, or if I do the replacement right before pushing the data into the db.
The first thing I tried to do was to use a secondary function that took a string as the argument and returned the string with characters replaced. I basically modified the solution that can be found here. So this function was:
from BeautifulSoup import BeautifulStoneSoup
def HTMLEntitiesToUnicode(text):
text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
return text
I utilized this in a loop that creates the dataset for a row of data to be pushed to sqlite as such:
def json_to_rows(json_file):
with open(json_file, 'r') as infile:
data = json.load(infile)
data_as_rows = []
length = len(data)
for i in range(0, length, 1):
data_as_rows.append((
data[i]['key'],
data[i]['data']['id'],
data[i]['links']['permalink'],
HTMLEntitiesToUnicode(data[i]['data']['series']),
HTMLEntitiesToUnicode(data[i]['data']['title']),
data[i]['data']['published_on'],
data[i]['data']['type'],
data[i]['data']['duration'],
HTMLEntitiesToUnicode(data[i]['data']['short_description']),
HTMLEntitiesToUnicode(data[i]['data']['long_description']),
))
return data_as_rows
However, this resulted in the following error when parsing HTMLEntitiesToUnicode(data[i]['data']['series'])
:
File "BeautifulSoup.py", line 1918, in _detectEncoding
'^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
TypeError: expected string or buffer
I can't figure out why BeautifulSoup isn't seeing this as a string, but I attempted to modify it to:
HTMLEntitiesToUnicode(str(data[i]['data']['series']))
Which then gave me the error:
File "support.py", line 162, in json_to_rows
HTMLEntitiesToUnicode(str(data[i]['data']['series'])),
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 129: ordinal not in range(128)
Adding .encode('utf-8')
did not resolve the error either (this was recommended on various other posts with the same error message).
My end goal is just to scrape all this info in to a db such that it's formatted as normal legible text (other than duration, which is of type INTEGER
anyway).
I'd like to do the replacing of the characters before/as data is being fed into the DB, but it's also possible that I can iterate through the DB in a separate function and clean it up, though that seems like a much less efficient way of doing it.