2

I have an input csv file and when I try to do some operations on it and make an output file, I am getting this error.

At first I got the 'utf-8' Error so I searched and checked the encoding of my file with this:

import chardet
with open('1out_test.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))
result

Output: {'confidence': 1.0, 'encoding': 'ascii'}

Then I wrote the following:

WORDS, N = ["aaaa", "tttt"], 1

pattern = (
    rf"((?:\S+ +){{0,{N}}}\S*"
    fr"\b(?:{'|'.join(map(re.escape, WORDS))})\b"
    rf"\S*(?: +\S+){{0,{N}}})"
)

pd.read_csv("1out_test.csv", encoding='ascii', low_memory=False).assign(info=lambda x: x["remarks"].str.extract(pattern,flags= re.IGNORECASE, expand=False).fillna("NA")).to_csv("output.csv", index=False)

This again gave me the same error but with 'ascii': 'ascii' codec can't decode byte 0xe2 in position 31: ordinal not in range(128)

NOTE: In both the errors, the position 31 was the same.

Fanatic
  • 43
  • 5
  • Does this answer your question? [UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)](https://stackoverflow.com/questions/18649512/unicodedecodeerror-ascii-codec-cant-decode-byte-0xe2-in-position-13-ordinal) – Harun Yilmaz Aug 15 '23 at 13:39
  • 1
    0xe2 is `â` in ISO-8859-1, so if your CSV-File contains that character, ISO-8859-1 is the encoduing you are looking for. – treuss Aug 15 '23 at 13:47
  • @treuss Or ISO-8859-15. – Matthias Aug 15 '23 at 13:53
  • I have tried a similar approach; encoding is 'ascii' for my input file but still shows the error. The top answer in the given post is doing which I don't think is applicable to my situation. – Fanatic Aug 15 '23 at 13:58

1 Answers1

0

Try replacing

pd.read_csv("1out_test.csv", encoding='ascii', low_memory=False).assign(info=lambda x: x["remarks"].str.extract(pattern,flags= re.IGNORECASE, expand=False).fillna("NA")).to_csv("output.csv", index=False)

with:

pd.read_csv("1out_test.csv", encoding='utf-8', low_memory=False).assign(info=lambda x: x["remarks"].str.extract(pattern,flags= re.IGNORECASE, expand=False).fillna("NA")).to_csv("output.csv", index=False)
Suley
  • 51
  • 4
  • Hey @Suley! I have already tried that, it gives me the error: 'utf-8' codec can't decode byte 0xe2 in position 31: invalid continuation byte – Fanatic Aug 15 '23 at 13:47