0

I am working on parsing a large number (90000) csv files. Some of the files are converted to text from pdf. So, they have a lot of noise in the form of weird characters. For example, Cachï¿. Some of these files have been converted online and some through pdfminer. Now, in my program I parse the files and remove the stop words.

cleanedRow = ' '.join([word for word in row[1].split() if word not in stopWrds])

But due to these weird encoding/decoding issues, my program fails. I cannot delete all such characters searching through 90000 files. The program throws the following error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

Is there an elegant way to ignore these characters in python?. Would appreciate any help. Thanks

CodeSsscala
  • 729
  • 3
  • 11
  • 23

0 Answers0