0

I've written a script that is supposed to retrieve html pages off a site and update their contents. The following function looks for a certain file on my system, then attempts to open it and edit it:

def update_sn(files_to_update, sn, table, title):
    paths = files_to_update['files']
    print('updating the sn')
    try:
        sn_htm = [s for s in paths if re.search('^((?!(Default|Notes|Latest_Addings)).)*htm$', s)][0]
        notes_htm = [s for s in paths if re.search('_Notes\.htm$', s)][0]

    except Exception:
        print('no sns were found')
        pass

    new_path_name = new_path(sn_htm, files_to_update['predecessor'], files_to_update['original'])
    new_sn_number = sn

    htm_text = open(sn_htm, 'rb').read().decode('cp1252')
    content = re.findall(r'(<table>.*?<\/table>.*)(?:<\/html>)', htm_text, re.I | re.S) 
    minus_content = htm_text.replace(content[0], '')
    table_soup = BeautifulSoup(table, 'html.parser')
    new_soup = BeautifulSoup(minus_content, 'html.parser')
    head_title = new_soup.title.string.replace_with(new_sn_number)
    new_soup.link.insert_after(table_soup.div.next)

    with open(new_path_name, "w+") as file:
        result = str(new_soup)
        try:
            file.write(result)
        except Exception:
            print('Met exception.  Changing encoding to cp1252')
            try:
                file.write(result('cp1252'))
            except Exception:
                print('cp1252 did\'nt work.  Changing encoding to utf-8')
                file.write(result.encode('utf8'))
                try:
                    print('utf8 did\'nt work.  Changing encoding to utf-16')
                    file.write(result.encode('utf16'))
                except Exception:
                    pass

This works in the majority of cases, but sometimes it fails to write, at which point the exception kicks in and I try every feasible encoding without success:

updating the sn
Met exception.  Changing encoding to cp1252
cp1252 did'nt work.  Changing encoding to utf-8
Traceback (most recent call last):
  File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 145, in update_sn
    file.write(result)
  File "C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 4006-4007: character maps to <undefined>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn
    file.write(result('cp1252'))
TypeError: 'str' object is not callable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scraper.py", line 79, in <module>
    get_latest(entries[0], int(num), entries[1])
  File "scraper.py", line 56, in get_latest
    update_files.update_sn(files_to_update, data['number'], data['table'], data['title'])
  File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 152, in update_sn
    file.write(result.encode('utf8'))
TypeError: write() argument must be str, not bytes

Can anyone give me any pointers on how to better handle html data that might have inconsistent encoding?

David J.
  • 1,753
  • 13
  • 47
  • 96

2 Answers2

1

Just out of curiosity, is this line of code a typo file.write(result('cp1252'))? Seems like it is missing .encode method.

Traceback (most recent call last):
  File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn
    file.write(result('cp1252'))
TypeError: 'str' object is not callable

Will it work perfectly if you modify the code to: file.write(result.encode('cp1252'))

I once had this write to file with encoding problem and brewed my own solution through the following thread:

Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence .

My problem solved by changing the html.parser parsing mode to html5lib. I root-caused my problem due to malformed HTML tag and solved it with html5lib parser. For your reference, this is the documentation for each parser provided by BeautifulSoup.

Hope this helps

Toto Lele
  • 394
  • 2
  • 13
1

In your code you open the file in text mode, but then you attempt to write bytes (str.encode returns bytes) and so Python throws an exception:

TypeError: write() argument must be str, not bytes

If you want to write bytes, you should open the file in binary mode.

BeautifulSoup detects the document’s encoding (if it is bytes) and converts it to string automatically. We can access the encoding with .original_encoding, and use it to encode the content when writting to file. For example,

soup = BeautifulSoup(b'<tag>ascii characters</tag>', 'html.parser')
data = soup.tag.text
encoding = soup.original_encoding or 'utf-8'
print(encoding)
#ascii

with open('my.file', 'wb+') as file:
    file.write(data.encode(encoding))

In order for this to work you should pass your html as bytes to BeautifulSoup, so don't decode the response content.

If BeautifulSoup fails to detect the correct encoding for some reason, then you could try a list of possible encodings, like you have done in your code.

data = 'Somé téxt'
encodings = ['ascii', 'utf-8', 'cp1252']

with open('my.file', 'wb+') as file:
    for encoding in encodings:
        try:
            file.write(data.encode(encoding))
            break
        except UnicodeEncodeError:
            print(encoding + ' failed.')

Alternatively, you could open the file in text mode and set the encoding in open (instead of encoding the content), but note that this option is not available in Python2.

t.m.adam
  • 15,106
  • 3
  • 32
  • 52