0

i'm trying to combine multiple CSV files into one with this Function :

import glob

path = r'/content/drive/My Drive/DatiAirQuality/MI_Air_Quality/data' 
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

but I get This Error: 'utf-8' codec can't decode byte 0xb5 in position 0: invalid start byte

and Here is The TraceBack:

   8 for filename in all_files:
   ---->  9     df = pd.read_csv(filename, index_col=None, 
   header=0)
   10     li.append(df)
   11 

Thank U.

ydrall
  • 338
  • 1
  • 3
  • 11
Dimi
  • 531
  • 3
  • 8
  • 20
  • It looks like your file is not `utf-8`. You should find out in which encoding it was saved and decode it. Or perhaps it is not a text file at all... – zvone Mar 09 '19 at 10:47
  • in fact it's not a text file , it contains only numerical Data. – Dimi Mar 09 '19 at 10:52
  • We can't tell you the correct encoding without seeing (a representative, ideally small sample of) the actual contents of the data in an unambiguous representation; a hex dump of the problematic byte(s) with a few bytes of context aon each side is often enough, especially if you can tell us what you think those bytes are suppored to represent. See also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors – tripleee Mar 09 '19 at 12:30
  • https://tripleee.github.io/8bit/#b5 shows 25 possible interpretations of this byte value in different 8-bit encodings, but none of them look particularly probable or useful. – tripleee Mar 09 '19 at 12:43

5 Answers5

7

try specifying this:

df = pd.read_csv(filename, index_col=None, header=0, encoding='latin-1')

the latin-1 encoding is magical - it never fails. See what you get. If this is good enough - well there you go.

If not, you'll have to find out what encoding the CSV files actually use. You could just try lots of different encodings until the answer seems OK.

  • 1
    The problem with this is that it looks like it succeeds even if the results are completely bogus. – tripleee Mar 09 '19 at 12:33
  • @tripleee you're correct, perhaps I wasn't clear enough in how I described my suggestion. It's supposed to be a work-around, not a true solution – Yoav Kleinberger Mar 15 '19 at 18:30
2

This worked for me

pd.read_csv(filename,encoding = 'unicode_escape')
Masih
  • 920
  • 2
  • 19
  • 36
1

I'd try:

pd.read_csv(filename, index_col=None, header=0, encoding='utf-8') #OR
pd.read_csv(filename, index_col=None, header=0, encoding='latin1')
razimbres
  • 4,715
  • 5
  • 23
  • 50
0

First you need to know the type of encoding that your CSV files use. You can try using Chardet: The Universal Character Encoding Detector to predict the encoding type used in your CSV file. Chardet can be easily installed using:

pip install chardet

After installing chardet you can use the command-line to predict your CSV file's encoding using:

chardet file_name.csv

The output will be something like this:

file_name.csv: UTF-8-SIG with confidence 1.0

Then check the encoding of your CSV file and then change the following line in your code:

df = pd.read_csv(filename, index_col=None, header=0)

to:

df = pd.read_csv(filename, index_col=None, header=0, encoding='utf-8')

You can check the available encodings supported by python. Hopefully this should solve your issue.

Chris Henry
  • 396
  • 2
  • 8
  • 2
    the open() function does not figure out the file's encoding - it uses the default encoding configured for python, which in this case is utf-8. *any* file you open will say 'utf-8', but if it's not true, there will be an exception once you try to read the file. Try it out - open some binary file like that and see what happens. – Yoav Kleinberger Mar 09 '19 at 11:24
  • @YoavKleinberger Thanks for the information. I edited it by adding a way to predict the encoding of the CSV file. – Chris Henry Mar 09 '19 at 11:37
  • 1
    `chardet` is not entriely reliable, it uses heuristics and doesn't examine the entire input file. – tripleee Mar 09 '19 at 12:04
  • @tripleee Agreed. It can just PREDICT with a confidence. So I do not expect it to be 100% accurate. Other than using chardet, we should check all possible encodings manually until we find one that work. I would love to know a robust solution in case you have one. – Chris Henry Mar 09 '19 at 12:08
  • There is no way to know the encoding unless you also know what it is supposed to represent. See e.g. https://stackoverflow.com/questions/436220/how-to-determine-the-encoding-of-text – tripleee Mar 09 '19 at 12:21
  • https://stackoverflow.com/questions/27832597/detect-actual-charset-encoding-in-utf has an answer of mine which implements the looping you propose, though in a slightly different context. – tripleee Mar 09 '19 at 12:24
  • UTF-8 was already ruled out by the error message the OP got. Googling for duplicates gets me all kinds of improbable guesses, among them GB2312 (a Chinese multibyte encoding). – tripleee Mar 09 '19 at 12:45
0

As I can see there have many answers already to encoding by pandas.

Here is an alternate approach:

with open(file_source, encoding="utf8", errors='ignore') as file:
    # Your code goes here
Farid Chowdhury
  • 2,766
  • 1
  • 26
  • 21
  • I don't think trying to open as utf-8 and ignoring any errors is the good way to do it, as it might not be utf-8 at all. Also, OP wanted to use pandas. – Midiparse Apr 12 '20 at 21:56
  • Yes, you are correct.. Now, I have modified my answer, as it is an alternate solution. This might be helpful for others who is not using pandas. – Farid Chowdhury Apr 12 '20 at 23:29