81

At the moment, I am trying to get a Python 3 program to do some manipulations with a text file filled with information, through the Spyder IDE/GUI. However, when trying to read the file I get the following error:

  File "<ipython-input-13-d81e1333b8cd>", line 77, in <module>
    parser(f)

  File "<ipython-input-13-d81e1333b8cd>", line 18, in parser
    data = infile.read()

  File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 29815: character maps to <undefined>

The code of the program is as follows:

import os

os.getcwd()

import glob
import re
import sqlite3
import csv

def parser(file):

    # Open a TXT file. Store all articles in a list. Each article is an item
    # of the list. Split articles based on the location of such string as
    # 'Document PRN0000020080617e46h00461'

    articles = []
    with open(file, 'r') as infile:
        data = infile.read()
    start = re.search(r'\n HD\n', data).start()
    for m in re.finditer(r'Document [a-zA-Z0-9]{25}\n', data):
        end = m.end()
        a = data[start:end].strip()
        a = '\n   ' + a
        articles.append(a)
        start = end

    # In each article, find all used Intelligence Indexing field codes. Extract
    # content of each used field code, and write to a CSV file.

    # All field codes (order matters)
    fields = ['HD', 'CR', 'WC', 'PD', 'ET', 'SN', 'SC', 'ED', 'PG', 'LA', 'CY', 'LP',
              'TD', 'CT', 'RF', 'CO', 'IN', 'NS', 'RE', 'IPC', 'IPD', 'PUB', 'AN']

    for a in articles:
        used = [f for f in fields if re.search(r'\n   ' + f + r'\n', a)]
        unused = [[i, f] for i, f in enumerate(fields) if not re.search(r'\n   ' + f + r'\n', a)]
        fields_pos = []
        for f in used:
            f_m = re.search(r'\n   ' + f + r'\n', a)
            f_pos = [f, f_m.start(), f_m.end()]
            fields_pos.append(f_pos)
        obs = []
        n = len(used)
        for i in range(0, n):
            used_f = fields_pos[i][0]
            start = fields_pos[i][2]
            if i < n - 1:
                end = fields_pos[i + 1][1]
            else:
                end = len(a)
            content = a[start:end].strip()
            obs.append(content)
        for f in unused:
            obs.insert(f[0], '')
        obs.insert(0, file.split('/')[-1].split('.')[0])  # insert Company ID, e.g., GVKEY
        # print(obs)
        cur.execute('''INSERT INTO articles
                       (id, hd, cr, wc, pd, et, sn, sc, ed, pg, la, cy, lp, td, ct, rf,
                       co, ina, ns, re, ipc, ipd, pub, an)
                       VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
                       ?, ?, ?, ?, ?, ?, ?, ?)''', obs)

# Write to SQLITE
conn = sqlite3.connect('factiva.db')
with conn:
    cur = conn.cursor()
    cur.execute('DROP TABLE IF EXISTS articles')
    # Mirror all field codes except changing 'IN' to 'INC' because it is an invalid name
    cur.execute('''CREATE TABLE articles
                   (nid integer primary key, id text, hd text, cr text, wc text, pd text,
                   et text, sn text, sc text, ed text, pg text, la text, cy text, lp text,
                   td text, ct text, rf text, co text, ina text, ns text, re text, ipc text,
                   ipd text, pub text, an text)''')
    for f in glob.glob('*.txt'):
        print(f)
        parser(f)

# Write to CSV to feed Stata
with open('factiva.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    with conn:
        cur = conn.cursor()
        cur.execute('SELECT * FROM articles WHERE hd IS NOT NULL')
        colname = [desc[0] for desc in cur.description]
        writer.writerow(colname)
        for obs in cur.fetchall():
            writer.writerow(obs)
halfer
  • 19,824
  • 17
  • 99
  • 186
user3443027
  • 951
  • 1
  • 7
  • 9

6 Answers6

136

As you see from https://en.wikipedia.org/wiki/Windows-1252, the code 0x9D is not defined in CP1252.

The "error" is e.g. in your open function: you do not specify the encoding, so python (just in windows) will use some system encoding. In general, if you read a file that maybe was not create in the same machine, it is really better to specify the encoding.

I recommend to put also a coding also on your open for writing the csv. It is really better to be explicit.

I do no know the original file format, but adding to open , encoding='utf-8' is usually a good thing (and it is the default in Linux and MacOs).

Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32
  • 12
    [What should I do when someone answers my question?](https://stackoverflow.com/help/someone-answers) – Roman Dec 31 '18 at 11:18
  • I use Wsl with Windows. My python script works fine on Linux, but does not work on Windows. How can I know which decoding Linux use, so I can use it on Windows (utf-8 doesn't work) – Sahin Jul 06 '21 at 11:02
  • Linux uses UTF-8 (but if you are using a old distribution never updated). "Do not work on WIndows" is nothing we can help: too generic. Common problem: you are using `print` to a shell/console/terminal which it is not set for UTF-8, or you are mixing encoding (some inputs may be on system encoding). You will find many answers (on this site) on windows encoding problems. Just you need to understand more the problem than just "do not work". – Giacomo Catenazzi Jul 06 '21 at 12:36
  • Adding encoding='utf-8' fixed the issue. – AndroDev Aug 20 '22 at 14:34
85

Add encoding in the open statement For example:

f=open("filename.txt","r",encoding='utf-8')
Martin Kinuthia
  • 1,031
  • 8
  • 7
22

The above did not work for me, try this instead: , errors='ignore' Worked wonders!

Romano
  • 619
  • 6
  • 12
  • 7
    using both encoding='utf-8' and errors='ignore' would make more sense – Eswar Apr 07 '19 at 06:49
  • 2
    Hiding the error is usually the wrong thing to do. This only makes sense in unusual circumstances, but more commonly is used in desperation by people who don't understand encoding. Now would be a good time to finally read [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) – tripleee Sep 23 '21 at 15:56
18

You can also try file = open(filename, 'rb') 'rb' translates to read binary if you wouldn't need to decode it. Say if you just want to upload to a website

Nnaobi
  • 399
  • 3
  • 12
6

errors='ignore' solved my headache in:

how to find the word "coma"in directories and subdirectories =

import os
rootdir=('K:\\0\\000.THU.EEG.nedc_tuh_eeg\\000edf.01_tcp_ar\\01_tcp_ar\\')
for folder, dirs, files in os.walk(rootdir):
    for file in files:
        if file.endswith('.txt'):
            fullpath = os.path.join(folder, file)
            with open(fullpath, 'r', errors='ignore') as f:
                for line in f:
                    if "coma" in line:
                        print(fullpath)
                        break
pkanda
  • 151
  • 3
  • 10
0

I dont believe that coding <errors=='ignore'> would be a good idea even if it works. Since you dont know what else it could be ignored, you should search for ways to bypass this problem without cutting off pieces of file.

I had this problem too when i attempted appending html as text into a file. You can try as i did, first return the content as bytes type, then convert it to string by decoding using 'utf-8'

converted_file = binary_file.decode('utf-8')