UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position Pandas

Question

I am trying to convert the df to list of dictionaries but getting Unicode error:

Code:-

df = reduce(lambda df_i, df_j: pd.concat([df_i,df_j]).drop_duplicates(subset=distinct_col),
            pd.read_csv(csv_filepath,
                        encoding='latin1',
                        engine='python',
                        skipinitialspace=True,
                        skiprows = header_count,
                        usecols=read_col,
                        iterator=True,
                        header=None,
                        names=csv_col_name,
                        chunksize=2,
                        sep='\s*[;]\s*',
                        dtype=str))
                        
df.drop(df.tail(footer_count).index,inplace=True)
#getting error while converting into list of dictionaries..
csv_records=df.to_dict('records')

Error:-

if encoding='latin1' then follwoing error comes: UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in position 85: ordinal not in range(128) _'ascii' codec can't encode character '\xfc' in position 85: ordinal not in range(128)_line no:171
if encoding='utf-8' then follwoing error comes: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 860: invalid start byte _'utf-8' codec can't decode byte 0xfc in position 860: invalid start byte_line no:159

Please suggest how can we resolve this issue.. Thanks in advance :)

On windows 'latin-1' is working fine but on unix server 'latin-1' and 'utf-8' both are not working and giving above errors. — user14270903, Sep 13 '20 at 17:52
We can’t tell you the correct encoding without seeing (a representative, ideally small sample of) the actual contents of the data in an unambiguous representation; a hex dump of the problematic byte(s) with a few bytes of context on each side is often enough, especially if you can tell us what you think those bytes are supposed to represent. See also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors — tripleee, Sep 13 '20 at 17:59
EU ;Eschborn ;0835- 123456-21-1 ;0005;083504419692; ;300;ABGLEICHKONTO EUREX ; 0.00000; 0.00000;20200423 ;CHF ;0010 ; this sample data — user14270903, Sep 13 '20 at 18:20
@Sergey chardet command is not working on unix saying command not found — user14270903, Sep 13 '20 at 18:21
i have run the command file -i * which gave below output Businessunit1.DAT: text/plain; charset=iso-8859-1 — user14270903, Sep 13 '20 at 18:30
We still can't tell you for sure whether that's correct without seeing the actual bytes. The `file` command only examines a small portion of the beginning of the file so its diagnostics are approximate at best. Switching to `'latin-1'` (aka ISO 8859-1) will definitely remove the second error, but might produce garbage. The first error - `'ascii' codec can't encode` - is still impossible to diagnose without a proper traceback, as far as I can tell. — tripleee, Sep 14 '20 at 05:49
The "sample data" you provided in a comment is all ASCII so it's impossible from that to deduce anything about the encoding (and seeing just the text doesn't allow us to infer anything about the actual underlying bytes anyway). Did you read the meta post I linked above? — tripleee, Sep 14 '20 at 05:52
Your German data almost certainly contains äöü and the question is how those are represented. If the `\xfc` byte represents ü then Latin-1 is a good guess, though perhaps see https://tripleee.github.io/8bit/#fc — tripleee, Sep 14 '20 at 13:38

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position Pandas

0 Answers0