pandas data with double quote

Question

I am trying to read a large dataset in .csv format which is update automatically using the pandas library. The problem is that in my data, the first row is a string without double quotation marks, and the other colums are strings with double quotation marks. It is not possible for me to adjust the .csv file manually.

A simplified dataset would look like this

A,"B","C","D"
comp_a,"tree","house","door"
comp_b,"truck","red","blue"

I need the data to be stored as separate columns without the quotation marks like this:

A B C D
comp_a tree house door
comp_b truck red blue

I tried using

import pandas as pd
df_csv = pd.read(path_to_file,delimiter=',')

which gives me the complete header as a single variable for the last column

A,"B","C","D"
comp_a "tree" "house" "door"
comp_b "truck" "red" "blue"

The closest result to the one i need was by using the following

df_csv = pd.read(path_to_file,delimiter=',',quoting=3)

which correctly recognizes each column, but adds in a bunch of extra double quotes.

"A ""B"" ""C"" ""D"""
"comp_a ""tree"" ""house"" ""door"""
"comp_b ""truck"" ""red"" ""blue"""

Setting quoting to a value from 0 to 2 just reads an entire row as a single column.

Does anyone know how I can remove all quotation marks when reading the .csv file?

Are you sure you are using pd.read? If I have a csv-file containing your three lines with mixed quotation marks, `pd.read_csv` reads it without any problems. — Kay Wittig, Jul 16 '18 at 09:53
To avoid any miscommunication, the first row (A,"B","C","D") is the header of my datafile and these are the values i want to read in as the column names. — Gerrit, Jul 16 '18 at 10:17
That is how I understood it and how pd.read_csv should read it with the default parameters. Which pandas version are you using? Can you post the return value from `pd.__version__`? — Kay Wittig, Jul 16 '18 at 10:41

Federico Gentile · Accepted Answer · 2018-07-16T11:20:33.140

12

Just load the data with pd.read_csv() and then use .replace('"','', regex=True)

In one line it would be:

df = pd.read_csv(filename, sep=',').replace('"','', regex=True)

To set the columns names:

df.columns = df.iloc[0]

And drop row 0:

df = df.drop(index=0).reset_index(drop=True)

edited Jul 16 '18 at 11:20

answered Jul 16 '18 at 09:57

Federico Gentile

5,650
10
47
102

Do you think `df = pd.read_csv(filename, sep=',').replace('"','', regex=True)` ? – jezrael Jul 16 '18 at 09:58
yeah i had tested on a series initially – Federico Gentile Jul 16 '18 at 10:03
When i try this, it removes the double quotes from the data correctly, so thank you for that. However, as the A,"B","C","D" row is the header of my datafile and thus I would like to use that as the column names, it still gives me just a single column which is named A,"B","C","D" instead of a single column named A, another named B etc. Sorry if I was not clear about the header part in my original question – Gerrit Jul 16 '18 at 10:19
@Gerrit ok now I see... then you just rename the columns and then drop the initial row and reset the index – Federico Gentile Jul 16 '18 at 14:31

Nihal · Answer 2 · 2018-07-16T10:01:14.917

2

you can replace " after read_csv and save that file again using df_csv.to_csv('fname')

df_csv.apply(lambda x:x.str.replace('"', ""))

edited Jul 16 '18 at 10:01

answered Jul 16 '18 at 09:53

Nihal

5,262
7
23
41

Kay Wittig · Answer 3 · 2018-07-16T10:04:08.477

Consider your data in a file data.csv like

$> more data.csv 
A,"B","C","D"
comp_a,"tree","house","door"
comp_b,"truck","red","blue"

Perhaps a newer pandas version would solve your problem from itself, e.g. at pd.__version__ = '0.23.1'

In [1]: import pandas as pd

In [2]: pd.read_csv('data.csv')
Out[2]: 
        A      B      C     D
0  comp_a   tree  house  door
1  comp_b  truck    red  blue

Otherwise apply a replace on the read-out

pd.read_csv('data.csv').replace('"', '')

pandas data with double quote

3 Answers3

Linked