im using python3.3 and a sqlite3 database. I have a big textfile around 270mb big which i can open with WordPad in Windows7.
Each line in that file looks as follows:
term \t number\n
I want to read every line and save the values in a database. My Code looks as follows:
f = open('sorted.de.word.unigrams', "r")
for line in f:
#code
I was able to read all data into my database but just to a certain line, i would suggest maybe half of all lines. Then im getting the following error:
File "C:\projects\databtest.py", line 18, in <module>
for line in f:
File "c:\python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 140: character maps to <undefined>
I tried to open the file with encoding = utf-8 but nothing worked even other codecs. Then i tried to make a copy with WordPad via save as utf-8 txt file. But WordPad crashed.
Where is the problem here, it looks like there is some character in that line that python cant handle. What can i do to completely read my file? Or is it maybe possible to ignore such Error messages and just go on with the next line?
You can download the packed file here:
http://wacky.sslmit.unibo.it/lib/exe/fetch.php?media=frequency_lists:sorted.de.word.unigrams.7z
Thanks alot!