1

I would like to import csv files with pandas. Normally my data is given in the form:

a,b,c,d
a1,b1,c1,d1
a2,b2,c2,d2

where a,b,c,d is the header. I can easily use the pandas.read_csv here. However, now I have data stored like this:

"a;b;c;d"
"a1;\"b1\";\"c1\";\"d1\""
"a2;\"b2\";\"c2\";\"d2\""

How can I clean this up in the most efficient way? How can I remove the string around the entire row so that it can detect the columns? And then how to remove all the "?

Thanks a lot for any help!!

I am not sure what to do. enter image description here

Luuk
  • 12,245
  • 5
  • 22
  • 33
Anna
  • 81
  • 5
  • 1
    What did you do to get your file in this format? Better fix it upstream – mozway Apr 04 '23 at 15:07
  • @mozway unfortunately I can't. It's an old file generated in the past – Anna Apr 04 '23 at 15:09
  • Well if it's a legacy file, then at least you can fix it once for all. You should do this with whatever tool works for you, separately from and in advance of trying to read it into Pandas. – John Bollinger Apr 04 '23 at 15:16
  • Are there any quotes (`"`) or escaped quotes (`\"`) that you need to keep? Do you need to change the delimiters from semicolons (`;`) to commas (`,`)? – John Bollinger Apr 04 '23 at 15:18
  • @JohnBollinger no I don't want to keep the escaped quotes. I don't neceassarily need to change the ; to , as I could use ; as the delimiter, I think, when using pandas.read_csv – Anna Apr 04 '23 at 15:29
  • This question seems to be a duplicate of: [Ignore character while importing with pandas](https://stackoverflow.com/questions/23810367/ignore-character-while-importing-with-pandas) – Luuk Apr 04 '23 at 15:35

3 Answers3

2

Here is one option with read_csv (and I'm sure we can make it better) :

df = (
        pd.read_csv("input.csv", sep=r";|;\\?", engine="python")
            .pipe(lambda df_: df_.set_axis(df_.columns.str.strip('"'), axis=1))
            .replace(r'[\\"]', "", regex=True)

     )

Output :

​
print(df)
​
    a   b   c   d
0  a1  b1  c1  d1
1  a2  b2  c2  d2
Timeless
  • 22,580
  • 4
  • 12
  • 30
0

You can use sed to breakdown the file into your chosen format.

For a simple example matching your issue using sed:

$ cat file 
"a1a1;"a1a1";"a1a1";"a1a1""
$ cat file | sed 's/"//g'
a1a1;a1a1;a1a1;a1a1

sed 's/"//g' This will replace all " chars with nothing, the g at the end tells sed to do this for every " char and not just the first found.

I see you edited the question, here is an update to the new text output:

$ cat file
"a1;\"b1\";\"c1\";\"d1\""
"a2;\"b2\";\"c2\";\"d2\""
$ cat file | sed 's/"//g' | sed 's|\\||g' 
a1;b1;c1;d1
a2;b2;c2;d2
  • Or `cat file | sed -e 's/"//g' -e 's|\\||g'`, no need to pipe twice. – Luuk Apr 04 '23 at 15:31
  • @crabpeople Thanks a lot for your fast response!! I am not exactly sure how to implement your suggestion. Additionally, is there a way to do this in python directly and not via the terminal? Because I have to do this for approx 150 datasets and in python I am looping over them. – Anna Apr 04 '23 at 15:33
  • The latter is potentially problematic in that it will remove backslashes that are not used to escape quotation marks, if any are present. – John Bollinger Apr 04 '23 at 15:33
  • You can loop in bash – juanpa.arrivillaga Apr 04 '23 at 16:27
0

When you need/want to do it in Python:

Just removing the leading and ending quotes:



file1 = open('abcd.csv',"r")
file2 = open('abcd-new.csv',"w")
lines = file1.readlines()

for line in lines:
    if (line.startswith("\"") and line.endswith("\"")):
         line = line[1:len(line)-1] 
    print(line)
    file2.write(line)
file2.close()

and when you also need to replace the \":



file1 = open('abcd.csv',"r")
file2 = open('abcd-new.csv',"w")
lines = file1.readlines()

for line in lines:
    if (line.startswith("\"") and line.endswith("\"")):
         line = line[1:len(line)-1] 
    line = line.replace("\"","")
    line = line.replace("\\","")
    print(line)
    file2.write(line)
file2.close()
Luuk
  • 12,245
  • 5
  • 22
  • 33