0
#create filepath for log files for the specific region
region_log_filepath = join(log_files_folder_path, region)

#files stores file paths
files = [join(region_log_filepath, file) for file in listdir(region_log_filepath) if isfile(join(region_log_filepath, file))]

for file in files :
           if file.endswith('csv'):
               filename = (file.split('Log-')[-1]).split('.csv')[0]
               print(f'\nreading file: {filename}')
               log_file = pd.read_csv(file,encoding='unicode_escape')

The above code gives the error : UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 26850-26851: truncated \UXXXXXXXX escape

I tried looking it up and found a post suggesting to convert it to a raw string. How would I add r' to file in the pd.read_csv() function ?

Soumya Pandey
  • 321
  • 3
  • 19
  • There are many encodings to choose from: https://docs.python.org/3/library/codecs.html#standard-encodings but I don't know what format your file is actually in. Maybe try 'utf_8' instead of 'unicode_escape'? – qrsngky Jun 20 '22 at 10:57
  • Does this answer your question? - https://stackoverflow.com/questions/1347791/unicode-error-unicodeescape-codec-cant-decode-bytes-cannot-open-text-file – Mortz Jun 20 '22 at 10:59
  • @qrsngky They are CSV files – Soumya Pandey Jun 20 '22 at 11:11

2 Answers2

0

You can pass encoding parameter in read_csv() method according to the format of the file,

you can try using one of these,

"utf-8",
"ISO-8859-1", 
"latin",
"cp1252"

Syntax: read_csv(file, encoding = "utf-8")

Read https://docs.python.org/3/library/codecs.html#standard-encodings for more.

Himanshu Kawale
  • 389
  • 2
  • 11
  • when I add utf-8 I get this error : UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 15: invalid start byte – Soumya Pandey Jun 20 '22 at 11:20
  • It is because the date is not encoded in `utf-8`. Try using `encoding="cp1252"` – Himanshu Kawale Jun 20 '22 at 11:35
  • when I add cp1252 I get this error : UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 22: character maps to – Soumya Pandey Jun 20 '22 at 12:15
  • please go through the link given in my solution at the end, there are lots of encoding standard, which standard to be used depends on the data, you will be able to find the appropriate standard as per your data format using that link. – Himanshu Kawale Jun 20 '22 at 13:24
0

I tried looking it up and found a post suggesting to convert it to a raw string. How would I add r' to file in the pd.read_csv() function ?

Then you only need to use 'utf-8' encoding in read_csv. It will ignore those Unicode-like sequences and treat it like the characters they are.

  • when I add utf-8 I get this error : UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 15: invalid start byte – Soumya Pandey Jun 20 '22 at 11:26
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 20 '22 at 17:16