2

Following an old question of mine. I finally identified what happens.

I have a csv-file which has the sperator \t and reading it with the following command:

df = pd.read_csv(r'C:\..\file.csv', sep='\t', encoding='unicode_escape')

the length for example is: 800.000

The problem is the original file has around 1.400.000 lines, and I also know where the issue occures, one column (let's say columnA) has the following entry:

"HILFE FüR DIE Alten

Do you have any idea what is happening? When I delete that row I get the correct number of lines (length), what is python doing here?

PV8
  • 5,799
  • 7
  • 43
  • 87
  • Can you provide example of dataframe? It may be cause of `encoding`. Can you try to read file without it? – talatccan Jan 09 '20 at 11:19
  • what is your question? I read it without the line and it succesfully, it will consists of german words. – PV8 Jan 09 '20 at 11:57

1 Answers1

1

According to pandas documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

sep : str, default ‘,’ Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.

It may be issue with double quotes symbol. Try this instead:

df = pd.read_csv(r'C:\..\file.csv', sep='\\t', encoding='unicode_escape', engine='python')

or this:

df = pd.read_csv(r'C:\..\file.csv', sep=r'\t', encoding='unicode_escape')
  • Both are working, do you think to change the seperator to `|` will it also occurce? and can you explain it more in detail? – PV8 Jan 09 '20 at 12:11
  • In your example separator string interpreted as regular expression. I am not sure about this behavior, probably \t is perceived as 2 characters. So to define multiple separators you may try define appropriate regex. Something like this: ```df = pd.read_csv(r'C:\..\file.csv', sep='\\t|\|', encoding='unicode_escape', engine='python')``` or more readable one: ```df = pd.read_csv(r'C:\..\file.csv', sep='[\t|]', encoding='unicode_escape')``` – Александр Немиров Jan 09 '20 at 13:56