0

Have this code:

Getting an error: utf-8' codec can't decode byte 0x92 in position 747: invalid start byte

tabula.convert_into_by_batch("C:\\PATH",output_format="csv",pages="all" )
files=os.path.join("C:\\PATH","*.csv")
files=glob.glob(files
files=os.path.join("C:\\PATH","*.csv")
files=glob.glob(files)
df=[]
df=pd.concat(map(pd.read_csv,files),ignore_index=True)
  • Welcome to StackOverflow! Have you seen these related posts [1](https://stackoverflow.com/q/29419322/1389394) and [2](https://stackoverflow.com/q/55857074/1389394) for a start? What do you use tabula.convert_into_by_batch here? No purpose for *df=[]* either. – bonCodigo May 25 '22 at 14:13
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community May 25 '22 at 14:43
  • error suggest that your file doesn't use `UTF-8` but ie. `latin1`, `cp1250`, etc. and you may need `read_csv(...., encoding='latin1')` – furas May 25 '22 at 16:15
  • code `b'\x92'.decode('cp1250')` gives me char `’` – furas May 25 '22 at 16:20

1 Answers1

0

Error suggests that file doesn't use encoding UTF-8 but ie. Latin1, CP1250, etc. and you may need read_csv(...., encoding='latin1')

If all files use the same encoding - ie. latin1 - then you may need

map(lambda name:pd.read_csv(name, encoding="latin1"), files)

but if files may use different encoding then you may need to use normal for-loop to run read_csv() in try/except to catch problem and run read_csv() again with different encoding.

Something like this:

# --- before loop ---

all_df = []

# --- loop ---

for name in files:
    for encoding in ['utf-8', 'latin1', 'cp1250']:
        try:
            df = read_csv(name, encoding=encoding)
            all_df.append(df)    
            break
        except:
            pass
    else:  # special construction `for/else` - it is executed when `break` wasn't used inside `for`-loop
        print("I can't read file:", name)

# --- after loop ---

df = pd.concat(all_df, ignore_index=True)
furas
  • 134,197
  • 12
  • 106
  • 148