-1

I'm using Selenium (Chromedriver) with Python 2.7 to scrape a website for some dynamic text that shows up in tags. Inside is HTML code nested in a JSON object, which is used to create a list of content on the page I'm viewing, but I'm only interested in getting the textual content. I was able to figure out how to clean out the HTML tags using re, but the text still contains HTML character codes for special characters, which I want to replace with the character they correspond to.

So, for example, say my json (after cleaning out HTML tags) is as follows:

[
    {
        "data": {
            "published_on": "2019-01-15T08:46:00+00:00", 
            "id": "somealphanumericid", 
            "short_description": "Albert Einstein’s Theory of Relativity: Should We Worry…?", 
            "series": "Science", 
            "long_description": "Albert Einstein does an interview with Mr. John Smith about the application of the theory of relativity, and what it could mean for the future of the pizza industry!", 
            "duration": "752000", 
            "type": "video", 
            "title": "Albert Einstein’s Theory of Relativity:"
        }, 
        "links": {
            "permalink": "https://www.stackoverflow.com"
        }, 
        "key": "somealphanumericid"
    },
    ...
]

Edit: The JSON object is actually an array of JSON objects, hence the []. The site I'm scraping is paginated, so I obtain the JSON from each page and at the end just concatenate them into one array so it's easier to work with.

You can see that characters such as periods, commas, colons, etc are scraped as their corresponding HTML character codes.

Now, I'm iterating over the JSON and putting everything into an sqlite database, so it doesn't matter if I replace the character codes in the JSON itself, or if I do the replacement right before pushing the data into the db.

The first thing I tried to do was to use a secondary function that took a string as the argument and returned the string with characters replaced. I basically modified the solution that can be found here. So this function was:

from BeautifulSoup import BeautifulStoneSoup

def HTMLEntitiesToUnicode(text):
    text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
    return text

I utilized this in a loop that creates the dataset for a row of data to be pushed to sqlite as such:

def json_to_rows(json_file):
    with open(json_file, 'r') as infile:
        data = json.load(infile)
        data_as_rows = []
        length = len(data)
        for i in range(0, length, 1):
            data_as_rows.append((
                data[i]['key'],
                data[i]['data']['id'],
                data[i]['links']['permalink'],
                HTMLEntitiesToUnicode(data[i]['data']['series']),
                HTMLEntitiesToUnicode(data[i]['data']['title']),
                data[i]['data']['published_on'],
                data[i]['data']['type'],
                data[i]['data']['duration'],
                HTMLEntitiesToUnicode(data[i]['data']['short_description']),
                HTMLEntitiesToUnicode(data[i]['data']['long_description']),
                ))

    return data_as_rows

However, this resulted in the following error when parsing HTMLEntitiesToUnicode(data[i]['data']['series']):

File "BeautifulSoup.py", line 1918, in _detectEncoding
    '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
TypeError: expected string or buffer

I can't figure out why BeautifulSoup isn't seeing this as a string, but I attempted to modify it to:

HTMLEntitiesToUnicode(str(data[i]['data']['series']))

Which then gave me the error:

File "support.py", line 162, in json_to_rows
    HTMLEntitiesToUnicode(str(data[i]['data']['series'])),
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 129: ordinal not in range(128)

Adding .encode('utf-8') did not resolve the error either (this was recommended on various other posts with the same error message).

My end goal is just to scrape all this info in to a db such that it's formatted as normal legible text (other than duration, which is of type INTEGER anyway).

I'd like to do the replacing of the characters before/as data is being fed into the DB, but it's also possible that I can iterate through the DB in a separate function and clean it up, though that seems like a much less efficient way of doing it.

Vash
  • 103
  • 1
  • 9

2 Answers2

0

I think the problem you have above is that your text is already in unicode format and you are trying to cast it to unicode a second time which is causing your error.

The below code was working for me and gave the outputs shown below.

from bs4 import BeautifulSoup

text = "Albert Einstein&#8217;s Theory of Relativity&#58; Should We Worry&#8230;&#63;"
parsed_html = BeautifulSoup(text)

print 'Original Type: ' + type(text)
print 'Original Text: ' + text
print 'Parsed Type:   ' + type(parsed_html.text)
print 'Parsed Text:   ' + parsed_html.text

Output:

Original Type: <type 'str'>
Original Text: Albert Einstein&#8217;s Theory of Relativity&#58; Should We Worry&#8230;&#63;
Parsed Type: <type 'unicode'>
Parsed Text: Albert Einstein’s Theory of Relativity: Should We Worry…?

Using BeautifulSoup4 version 4.7.1

pip install bs4

cullzie
  • 2,705
  • 2
  • 16
  • 21
  • 1
    This does work as a solution, though I went with the original solution I had tried that lead me to an error message. The problem I was having was I was reading from a non-utf-8 .json file, so I made sure those files were saved in utf-8 first. – Vash Jan 16 '19 at 08:25
0

It turns out that the reason HTMLEntitiesToUnicode() wasn't working for me was because I was reading data from a .json file that had been written to without indicating that it should be saved in utf-8. Fixing this and then using HTMLEntititesToUnicode() as described above worked fine.

Vash
  • 103
  • 1
  • 9