0

I am trying to convert the df to list of dictionaries but getting Unicode error:

Code:-

df = reduce(lambda df_i, df_j: pd.concat([df_i,df_j]).drop_duplicates(subset=distinct_col),
            pd.read_csv(csv_filepath,
                        encoding='latin1',
                        engine='python',
                        skipinitialspace=True,
                        skiprows = header_count,
                        usecols=read_col,
                        iterator=True,
                        header=None,
                        names=csv_col_name,
                        chunksize=2,
                        sep='\s*[;]\s*',
                        dtype=str))
                        
df.drop(df.tail(footer_count).index,inplace=True)
#getting error while converting into list of dictionaries..
csv_records=df.to_dict('records')

Error:-

  1. if encoding='latin1' then follwoing error comes: UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in position 85: ordinal not in range(128) _'ascii' codec can't encode character '\xfc' in position 85: ordinal not in range(128)_line no:171
  2. if encoding='utf-8' then follwoing error comes: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 860: invalid start byte _'utf-8' codec can't decode byte 0xfc in position 860: invalid start byte_line no:159

Please suggest how can we resolve this issue.. Thanks in advance :)

  • On windows 'latin-1' is working fine but on unix server 'latin-1' and 'utf-8' both are not working and giving above errors. – user14270903 Sep 13 '20 at 17:52
  • We can’t tell you the correct encoding without seeing (a representative, ideally small sample of) the actual contents of the data in an unambiguous representation; a hex dump of the problematic byte(s) with a few bytes of context on each side is often enough, especially if you can tell us what you think those bytes are supposed to represent. See also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors – tripleee Sep 13 '20 at 17:59
  • Try `chardet file_name` to detect the file encoding – Sergey Bushmanov Sep 13 '20 at 17:59
  • EU ;Eschborn ;0835- 123456-21-1 ;0005;083504419692; ;300;ABGLEICHKONTO EUREX ; 0.00000; 0.00000;20200423 ;CHF ;0010 ; this sample data – user14270903 Sep 13 '20 at 18:20
  • @Sergey chardet command is not working on unix saying command not found – user14270903 Sep 13 '20 at 18:21
  • i have run the command file -i * which gave below output Businessunit1.DAT: text/plain; charset=iso-8859-1 – user14270903 Sep 13 '20 at 18:30
  • We still can't tell you for sure whether that's correct without seeing the actual bytes. The `file` command only examines a small portion of the beginning of the file so its diagnostics are approximate at best. Switching to `'latin-1'` (aka ISO 8859-1) will definitely remove the second error, but might produce garbage. The first error - `'ascii' codec can't encode` - is still impossible to diagnose without a proper traceback, as far as I can tell. – tripleee Sep 14 '20 at 05:49
  • The "sample data" you provided in a comment is all ASCII so it's impossible from that to deduce anything about the encoding (and seeing just the text doesn't allow us to infer anything about the actual underlying bytes anyway). Did you read the meta post I linked above? – tripleee Sep 14 '20 at 05:52
  • Your German data almost certainly contains äöü and the question is how those are represented. If the `\xfc` byte represents ü then Latin-1 is a good guess, though perhaps see https://tripleee.github.io/8bit/#fc – tripleee Sep 14 '20 at 13:38

0 Answers0