2

im using python3.3 and a sqlite3 database. I have a big textfile around 270mb big which i can open with WordPad in Windows7.

Each line in that file looks as follows:

term \t number\n

I want to read every line and save the values in a database. My Code looks as follows:

f = open('sorted.de.word.unigrams', "r")
for line in f:

    #code

I was able to read all data into my database but just to a certain line, i would suggest maybe half of all lines. Then im getting the following error:

File "C:\projects\databtest.py", line 18, in <module>
for line in f:
File "c:\python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 140: character maps to   <undefined>

I tried to open the file with encoding = utf-8 but nothing worked even other codecs. Then i tried to make a copy with WordPad via save as utf-8 txt file. But WordPad crashed.

Where is the problem here, it looks like there is some character in that line that python cant handle. What can i do to completely read my file? Or is it maybe possible to ignore such Error messages and just go on with the next line?

You can download the packed file here:

http://wacky.sslmit.unibo.it/lib/exe/fetch.php?media=frequency_lists:sorted.de.word.unigrams.7z

Thanks alot!

zwieback86
  • 387
  • 3
  • 7
  • 14
  • You could try to directly import the file into sqlite3 without python. – elyase Sep 06 '13 at 00:50
  • Im trying it right now with SQLite Database Browser, its taking ages... But for me its not a good solution because before writing the values to the database i have to add some numbers to the number what i have to decide with python. So im still looking for another solution via python. – zwieback86 Sep 06 '13 at 01:09
  • see http://stackoverflow.com/a/9233174/14420, you need to open with `encoding=...` – matt wilkie Feb 16 '14 at 22:34

3 Answers3

6

I checked the file, and the root of the problem seems to be that the file contains words in at least two encodings: probably cp1252 and cp850. The character 0x81 is ΓΌ in cp850 but undefined in cp1252. You can handle that situation by catching the exception, but some other German characters map to valid but wrong characters in cp1252. If you are happy with such an imperfect solution, here's how you could do it:

with open('sorted.de.word.unigrams','rb') as f: #open in binary mode
    for line in f:
        for cp in ('cp1252', 'cp850'):
            try:
                s = line.decode(cp)
            except UnicodeDecodeError:
                pass
            else:
                store_to_db(s)
                break
Janne Karila
  • 24,266
  • 6
  • 53
  • 94
2

This usually happens when there is encoding mismatch.

0x81 does not mean anything, try specifying the encoding

file = open(filename, encoding="utf8")
atg
  • 204
  • 3
  • 18
1

Try

data = []
import codecs
with codecs.open('sorted.de.word.unigrams', 'r') as f:
    for line in f:
         data.append(line)

If you want to ignore error, you can do

try:
    # Your code that enter data to database
except UnicodeDecodeError:
    pass
Vladimir Chub
  • 461
  • 6
  • 19