import dirty csv file with unwanted characters, strings

Question

I would like to import csv files with pandas. Normally my data is given in the form:

a,b,c,d
a1,b1,c1,d1
a2,b2,c2,d2

where a,b,c,d is the header. I can easily use the pandas.read_csv here. However, now I have data stored like this:

"a;b;c;d"
"a1;\"b1\";\"c1\";\"d1\""
"a2;\"b2\";\"c2\";\"d2\""

How can I clean this up in the most efficient way? How can I remove the string around the entire row so that it can detect the columns? And then how to remove all the "?

Thanks a lot for any help!!

I am not sure what to do. enter image description here

What did you do to get your file in this format? Better fix it upstream — mozway, Apr 04 '23 at 15:07
@mozway unfortunately I can't. It's an old file generated in the past — Anna, Apr 04 '23 at 15:09
Well if it's a legacy file, then at least you can fix it once for all. You should do this with whatever tool works for you, separately from and in advance of trying to read it into Pandas. — John Bollinger, Apr 04 '23 at 15:16
Are there any quotes (`"`) or escaped quotes (`\"`) that you need to keep? Do you need to change the delimiters from semicolons (`;`) to commas (`,`)? — John Bollinger, Apr 04 '23 at 15:18
@JohnBollinger no I don't want to keep the escaped quotes. I don't neceassarily need to change the ; to , as I could use ; as the delimiter, I think, when using pandas.read_csv — Anna, Apr 04 '23 at 15:29
This question seems to be a duplicate of: [Ignore character while importing with pandas](https://stackoverflow.com/questions/23810367/ignore-character-while-importing-with-pandas) — Luuk, Apr 04 '23 at 15:35

score 2 · Accepted Answer · answered Apr 04 '23 at 15:58

2

Here is one option with read_csv (and I'm sure we can make it better) :

df = (
        pd.read_csv("input.csv", sep=r";|;\\?", engine="python")
            .pipe(lambda df_: df_.set_axis(df_.columns.str.strip('"'), axis=1))
            .replace(r'[\\"]', "", regex=True)

     )

Output :


print(df)

    a   b   c   d
0  a1  b1  c1  d1
1  a2  b2  c2  d2

answered Apr 04 '23 at 15:58

Timeless

22,580
4
12
30

`pipe` here seems to be unnecessary – juanpa.arrivillaga Apr 04 '23 at 16:27
I don't think so. Otherwise, you'll get an error. Try it ;) – Timeless Apr 04 '23 at 16:30
1

Ah, got it, missed that you were using `df_.columns` as an argument to `.set_axis` – juanpa.arrivillaga Apr 04 '23 at 17:30

crabpeople · Answer 2 · 2023-04-04T15:22:02.280

0

You can use sed to breakdown the file into your chosen format.

For a simple example matching your issue using sed:

$ cat file 
"a1a1;"a1a1";"a1a1";"a1a1""
$ cat file | sed 's/"//g'
a1a1;a1a1;a1a1;a1a1

sed 's/"//g' This will replace all " chars with nothing, the g at the end tells sed to do this for every " char and not just the first found.

I see you edited the question, here is an update to the new text output:

$ cat file
"a1;\"b1\";\"c1\";\"d1\""
"a2;\"b2\";\"c2\";\"d2\""
$ cat file | sed 's/"//g' | sed 's|\\||g' 
a1;b1;c1;d1
a2;b2;c2;d2

edited Apr 04 '23 at 15:22

answered Apr 04 '23 at 15:15

crabpeople

1
4

Or `cat file | sed -e 's/"//g' -e 's|\\||g'`, no need to pipe twice. – Luuk Apr 04 '23 at 15:31
@crabpeople Thanks a lot for your fast response!! I am not exactly sure how to implement your suggestion. Additionally, is there a way to do this in python directly and not via the terminal? Because I have to do this for approx 150 datasets and in python I am looping over them. – Anna Apr 04 '23 at 15:33
The latter is potentially problematic in that it will remove backslashes that are not used to escape quotation marks, if any are present. – John Bollinger Apr 04 '23 at 15:33
You can loop in bash – juanpa.arrivillaga Apr 04 '23 at 16:27

score 0 · Answer 3 · answered Apr 04 '23 at 15:42

0

When you need/want to do it in Python:

Just removing the leading and ending quotes:



file1 = open('abcd.csv',"r")
file2 = open('abcd-new.csv',"w")
lines = file1.readlines()

for line in lines:
    if (line.startswith("\"") and line.endswith("\"")):
         line = line[1:len(line)-1] 
    print(line)
    file2.write(line)
file2.close()

and when you also need to replace the \":



file1 = open('abcd.csv',"r")
file2 = open('abcd-new.csv',"w")
lines = file1.readlines()

for line in lines:
    if (line.startswith("\"") and line.endswith("\"")):
         line = line[1:len(line)-1] 
    line = line.replace("\"","")
    line = line.replace("\\","")
    print(line)
    file2.write(line)
file2.close()

answered Apr 04 '23 at 15:42

Luuk

12,245
5
22
33

*don't use `readlines`* – juanpa.arrivillaga Apr 04 '23 at 16:09
@juanpa.arrivillaga: Sorry, I am missing a reason not to use _readlines_ – Luuk Apr 04 '23 at 16:10
@juanpa.arrivillaga: OK, found it here: [The short version is: The efficient way to use readlines() is to not use it. Ever.](https://stackoverflow.com/a/17246300/724039) – Luuk Apr 04 '23 at 16:13
Just iterate over the file object directly – juanpa.arrivillaga Apr 04 '23 at 16:26

import dirty csv file with unwanted characters, strings

3 Answers3