12

I've read this, this and this posts but despite I don't know why quotechar does not work at pd.read_csv() (Python 3, pandas 0.18.0 and 0.18.1). And how could I read a dataframe like this:

"column1","column2", "column3", "column4", "column5", "column6"
"AM", 7, "1", "SD", "SD", "CR"
"AM", 8, "1,2 ,3", "PR, SD,SD", "PR ; , SD,SD", "PR , ,, SD ,SD"
"AM", 1, "2", "SD", "SD", "SD"

I want the following result:

Out[116]: 
  column1  column2 column3    column4       column5        column6
0      AM        7       1         SD            SD             CR
1      AM        8  1,2 ,3  PR, SD,SD  PR ; , SD,SD  PR , ,, SD,SD
2      AM        1       2         SD            SD             SD

Thank you!!

Community
  • 1
  • 1
ragesz
  • 9,009
  • 20
  • 71
  • 88
  • pd.read_csv('test.csv',sep=',\s+') – backtrack May 06 '16 at 14:31
  • Thank you for your answer, but as I wrote, I read the linked post so I tried this solution too, but does not work. I get this error messages: `ValueError: Expected 6 fields in line 3, saw 10`. Does it work for you? – ragesz May 06 '16 at 14:43

2 Answers2

23

Pandas doc on separators in read_csv():

Separators longer than 1 character and different from '\s+' will be interpreted as regular expressions, will force use of the python parsing engine and will ignore quotes in the data.

Try using this instead (sep by default set to a comma):

pd.read_csv(file, skipinitialspace = True, quotechar = '"')
ptrj
  • 5,152
  • 18
  • 31
  • 2
    The important bit for me was `skipinitialspace = True` which ignores the space between the delimiters. The 'Error' in OPs CSV File is the space between commas, which is - apparently - not typical for csv. – rwenz3l Jun 15 '18 at 08:58
  • That does not work for me, pandas==0.22.0. My separator is ";", does this make a difference? should it? – ricoms Oct 22 '18 at 17:55
  • @RicardoMS Have you tried it with `sep=';'`, that is `pd.read_csv(file, skipinitialspace=True, quotechar='"', sep=';')`? – ptrj Oct 26 '18 at 21:37
  • 1
    @ptrj thanks! I found the problem, I needed to add `escapechar='\\'` argument. My file was very "dirty". Thanks a lot for the answer. :) – ricoms Oct 30 '18 at 06:48
0

Another solution is to use a proper regular expression instead of the simple \s+. We need to find comma (,) which is not within quotation marks:

pd.read_csv(file, 
            sep=', (?=(?:"[^"]*?(?: [^"]*)*))|, (?=[^",]+(?:,|$))',
            engine='python')

The expression is taken from here.

yoonghm
  • 4,198
  • 1
  • 32
  • 48