2

Solution :

See answer, it was not encoded in CP1252 but in UTF-16 . Solution code is :

import pandas as pd

df = pd.read_csv('my_file.csv', sep='\t', header=1, encoding='utf-16')

Also works with encoding='utf-16-le'


Update : output of the first 3 lines in bytes :

In : import itertools 
...:  print(list(itertools.islice(open('file_T.csv', 'rb'), 3)))

Out : [b'\xff\xfe"\x00D\x00u\x00 \x00m\x00e\x00r\x00c\x00r\x00e\x00d\x00i\x00 \x000\x005\x00 \x00j\x00u\x00i\x00n\x00 \x002\x000\x001\x009\x00 \x00a\x00u\x00 \x00m\x00e\x00r\x00c\x00r\x00e\x00d\x00i\x00 \x000\x005\x00 \x00j\x00u\x00i\x00n\x00 \x002\x000\x001\x009\x00\n', b'\x00"\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\n', b'\x00C\x00o\x00d\x00e\x00 \x00M\x00C\x00U\x00\t\x00I\x00m\x00m\x00a\x00t\x00r\x00i\x00c\x00u\x00l\x00a\x00t\x00i\x00o\x00n\x00\t\x00D\x00a\x00t\x00e\x00\t\x00h\x00e\x00u\x00r\x00e\x00\t\x00V\x00i\x00t\x00e\x00s\x00s\x00e\x00\t\x00L\x00a\x00t\x00i\x00t\x00u\x00d\x00e\x00\t\x00L\x00o\x00n\x00g\x00i\x00t\x00u\x00d\x00e\x00\t\x00T\x00y\x00p\x00e\x00\t\x00E\x00n\x00t\x00r\x00\xe9\x00e\x00\t\x00E\x00t\x00a\x00t\x00\n']

I'm working with csv files whose raw form is :

screen du début de file_T

The problem is that it has two features raising a problem together :

  • the first row is not the header

  • There is an accent in header "Entrée", which raises an UnicodeDecode Error if I don't precise the encoding cp1252

I'm using Python 3.X and pandas to deal with these files.

But when I try to read it with this code :

import pandas as pd 

df_T = pd.read_csv('file_T.csv', header=1, sep=';', encoding = 'cp1252')
print(df_T)

I get the following output (same with header=0): the read_csv error on file_T

In order to read the csv correctly, I need to :

  • get rid of the accent
  • and ignore / delete the first row (which I don't need anyway).

How can I achieve that ?

PS : I know I could make a VBA program or something for this, but I'd rather not. I'm interested in including it in my Python program, or in knowing for sure that it is not possible.

Community
  • 1
  • 1
ToddEmon
  • 1,140
  • 1
  • 12
  • 16
  • Are you sure this is an ASCII file? Those weird bytes look like a BOM mark – Panagiotis Kanavos Jul 10 '19 at 09:49
  • Please post the output of `import itertools; print(list(itertools.islice(open('file_T.csv', 'rb'), 3)))`. This will show us the first 3 lines of `file_T.csv` *as bytes* and thus help us reproduce the problem. – unutbu Jul 10 '19 at 09:50
  • Did you try an skiprows argument? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html ``` skiprows : list-like, int or callable, optional Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2]. ``` – s3nh Jul 10 '19 at 09:55
  • Update with itertools input. I tried `df_T = pd.read_csv('file_T.csv', skiprows=0, sep=';', encoding = 'cp1252')` and got the same results. Also tried with `skiprows=1` which gives me 'Unnamed:0' instead of _ÿþ"_, and with and without `header=0` or `header=1` but that doesn't change a thing. – ToddEmon Jul 10 '19 at 10:03

1 Answers1

6

CP1252 is the plain old Latin codepage, which does support all Western European accents. There wouldn't be any garbled characters if the file was written in that codepage.

The image of the data you posted is just that - an image. It says nothing about the file's raw format. Is it a UTF8 file? UTF16? It's definitely not CP1252.

Neither UTF8 nor CP1252 would produce NANs either. Any single-byte codepage would read the numeric digits at least, which means the file is saved in a multi-byte encoding.

The two strange characters at the start look like a Byte Order Mark. If you check Wikipedia's BOM entry you'll see that ÿþ is the BOM for UTF16LE.

Try using utf-16 or utf-16-le instead of cp1252

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
  • Following instructions from an answer at [link](https://stackoverflow.com/questions/37177069/how-to-check-encoding-of-a-csv-file) , I used `with open('file_T.csv') as f: print(f)` and got encoding='cp1252' which lead me to consider it was CP1252. `utf-16` and `utf-16-le` work for me, thank you ! – ToddEmon Jul 10 '19 at 10:08