-1

I have data to read in .csv file which looks like:

col1,col2,col3,col4,col5
"val1,val2,""{'key1': 'value1', 'key2': 'value2'}"",val4,val5"
"va11,val12,""{'key11': 'value11', 'key12': 'value12'}"",val14,val15"

I've tried import this file via pandas in many ways but always get an error. Is it easy way to do this with pandas?

  • 3
    That's not a valid CSV - or more precisely, those outer quotes make it a single column CSV. Is there any chance you can fix the thing that writes it? If all of the lines are written this way, you could scrub the lines by stripping the outer quotes on both sides. – tdelaney Jul 30 '22 at 23:41
  • That is not a good close. The linked answer may be the next step for OP, but it is not the question asked. Lets reopen so we can give a proper answer. – tdelaney Jul 31 '22 at 00:07
  • It would help to show one attempt to read via pandas and the error you got. i can read it and get a dataframe, its just not quite the right dataframe. – tdelaney Jul 31 '22 at 00:11
  • @tdelaney Actually my orignal file has more columns and some of them are full date format (YYYY-MM-DD HH:MM:SS). I've tried `pd.read_csv("file.csv", sep = ',', header = 0)` - then all goes to first column. Others columns has only NAN values. I try also specify parametr `quotechar` and `lineterminator` but I got error there is more columns than expected. – Maciess Jul 31 '22 at 00:48
  • Even with the outer quotes gone, there are still the double double-quotes around the dict (not calling it JSON because of single quotes). From the trivial example, just deleting all double-quotes would be fine. OP, can you provide some real data to help us get a sense for just how much the misquoting will affect you? – Zach Young Jul 31 '22 at 01:53
  • The source of the problem is in whatever created this CSV in the first place. It looks like a line that would have been perfectly fine as a CSV row was put through a CSV writer which then added quotes to escape the commas an quotes already in the line. The solution is to fix the writer if you can. – tdelaney Jul 31 '22 at 15:25

1 Answers1

0

These rows look like valid CSV rows that were then put through a CSV writer again. That second pass turned the row into a single column by adding quotes to escape the commas and quotes in the already-CSV'd file. You could reverse that process to load the CSV. Or fix the writer which is the real source of the bug.

import csv
import pandas as pd
import io

unmangled = io.StringIO()
with open("test.csv", newline="") as infile:
    # header is unmanagled, so just write it
    unmangled.write(next(infile))
    # read the CSV - the first column is a CSV encoded CSV row
    unmangled.writelines(row[0] + "\n" for row in csv.reader(infile))

# rewind and read the unmangled CSV
unmangled.seek(0)
df = pd.read_csv(unmangled)
print(df)

Output

   col1   col2                                      col3   col4   col5
0  val1   val2      {'key1': 'value1', 'key2': 'value2'}   val4   val5
1  va11  val12  {'key11': 'value11', 'key12': 'value12'}  val14  val15
tdelaney
  • 73,364
  • 6
  • 83
  • 116