0

I have a problem with reading in a csv with an id field with mixed dtypes from the original source data, i.e. the id field can be 11, 2R399004, BL327838, 7 etc. but the vast majority of them being 8 characters long.

When I read it with multiple versions of pd.read_csv and encoding='iso-8859-1' it always converts the 7 and 11 to 00000007 or the like. I've tried using utf-8 but I get the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 40: unexpected end of data

I have tried setting the dtype={'field': object} and string and various iterations of latin-1 and the like but it will continually do this.

Is there any way to get around this error, without going through every individual file and fixing the dtypes?

  • 1
    What is the encoding of the files? What are the dtypes of the field, if you import it with `'iso-8859-1'`? When you say "multiple versions of pd.read_csv," what do you mean? – Evan Feb 07 '18 at 18:25
  • Have you tried this? https://stackoverflow.com/questions/13142347/how-to-remove-leading-and-trailing-zeros-in-a-string-python – Evan Feb 07 '18 at 18:30
  • Can we see some sample records from your csv file? – Bill Bell Feb 07 '18 at 19:49

1 Answers1

0

Basically the column looks like this

Column_ID 10 HGF6558 059 KP257 0001