1

I've read all related topics - like this, this and this - but couldn't get a solution to work.

I have an input csv file like this:

ItemId,Content                                                      
i0000008,{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010,{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

I've tried several different approaches but couldn't get it to work. I want to read this csv file into a Dataframe like this:

ItemId    Content
--------  -------------------------------------------------------------------------------
i0000008  {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010  {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

With following code (Python 3.9):

df = pd.read_csv('test.csv', sep=',', skipinitialspace = True, quotechar = '"')

As far as I understand, commas inside dictionary column and commas inside quotation marks are being treated as regular separators, so it raises following error:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6

Is it possible to produce desired result? Thanks.

2 Answers2

2

The problem is that the comma's in the Content column are interpreted as separators. You can solve this by using pd.read_fwf to manually set the number of characters on which to split:

df = pd.read_fwf('test.csv', colspecs=[(0, 8),(9,100)], header=0, names=['ItemId', 'Content'])  

Result:

ItemId Content
0 i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1 i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
RJ Adriaansen
  • 9,131
  • 2
  • 12
  • 26
  • This doesn't solve the problem, and gives only the first key,value in the dictionary for each row – ThePyGuy Jul 22 '21 at 19:32
  • @ThePyGuy is correct. It didn't produce the desired results, only first key,value was read into Content column. – thatOldITGuy Jul 22 '21 at 19:48
  • @thatOldITGuy You're right. I have updated the answer with a much simpler solution. – RJ Adriaansen Jul 22 '21 at 20:26
  • Good solution above. After this the next step for the OP is to follow the [steps to expand the JSON](https://stackoverflow.com/questions/50054208/pandas-expand-column-containing-json-encoded-array-of-observations-into-rows) and the df is good. – MDR Jul 22 '21 at 20:42
  • @RJAdriaansen it is a good solution, but the real CSV has a much larger (and variable length) JSON column, so I guess the (9,100) column spec would have to be guessed, right? – thatOldITGuy Jul 23 '21 at 12:57
  • @thatOldITGuy it will work fine as long as the number of characters in the first column is consistent, which seems to be the case. You can change 100 to a random number that is larger than the number of characters in the longest row – RJ Adriaansen Jul 23 '21 at 16:31
0

I don't think you'll be able to read it normally with pandas because it has the delimiter used multiple times for a single value; however, reading it with python and doing some processing, you should be able to convert it to pandas dataframe:

def splitValues(x):
    index = x.find(',')
    return x[:index], x[index+1:].strip()

import pandas as pd
data = open('file.csv')
columns = next(data)
columns = columns.strip().split(',')
df = pd.DataFrame(columns=columns, data=(splitValues(row) for row in data))

OUTPUT:

     ItemId                                                                          Content
0  i0000008   {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1  i0000010  {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
ThePyGuy
  • 17,779
  • 5
  • 18
  • 45