1

I am creating a ticket tracking project where I have a pandas dataframe holding the ticket information. I am then storing this dataframe into a csv file. The dataframe will initialize at the start of the program.

One of the column values is a list. When you store the pandas dataframe in a csv file with this line of code: self.ticketDF.to_csv(self.ticketCSVFilePath), it surrounds the list in double quotes. When you then read it back in, it now is interpreted as a string, not a list. In my example, you can see the list with double quotes under the comments column.

Here is my file - tickets.csv:

Ticket ID,Subject,Project,Description,Priority,Comments
PROT-18,testSubject,testProject,testDescription,testPriority,"['comment1', 'comment2', 'comment3']"
PROT-19,testSubject,testProject,testDescription,testPriority,"['comment4', 'comment5', 'comment6']"

I am initializing the pandas dataframe, using these two functions:

def initializeTicketDF(self):
   if path.exists(self.ticketCSVFilePath) and path.getsize(self.ticketCSVFilePath) > 0:
       self.ticketDF = pd.read_csv(self.ticketCSVFilePath)  #reading the csv file into the dataframe
       self.ticketDF.set_index('Ticket ID', inplace=True)   #I am setting the index to the Ticket ID
       self.columnToList("Comments")                        #Calling my function that currently does the 
                                                            #workaround

def columnToList(self, columnName):
   count = 0                                #this represents the current row in the dataframe
   for x in self.ticketDF['Comments']:      #x holds the "Comments" value for every row
       x = x.replace('[', '')               #replace left and right brackets
       x = x.replace(']', '')
       x = re.findall('\'([^\']*)\'', x)    #get a list of all values between single quotes
       self.ticketDF[columnName][count] = x #store the list back into the dataframe
       count += 1 

To work around this issue as shown above, I am replacing each bracket separately and then getting a list of all values between single quotations with this line: x = re.findall('\'([^\']*)\'', x) I am then storing the list back into the dataframe row by row.

I have also tried using csv.DictReader/Writer and it does the same thing.

Is there a way to read the list in the csv without having to do any string modifications? Is there a regular expression I could use to clean up the list's string?

Any thoughts would be greatly appreciated. Thanks!

Jeff Gruenbaum
  • 364
  • 6
  • 21
  • Does this answer your question? [Pandas read csv with regex separator](https://stackoverflow.com/questions/61142670/pandas-read-csv-with-regex-separator) – imbr May 31 '20 at 00:33

1 Answers1

2

You can pass a converted for a column to pd.read_csv():

import pandas as pd
from ast import literal_eval

p = pd.read_csv(path, converters={'Comments':literal_eval})

p['Comments']
# 0    [comment1, comment2, comment3]
# 1    [comment4, comment5, comment6]

p['Comments'][0][1]
# 'comment2'

literal_eval will safely evaluate simple expressions like your list.

Mark
  • 90,562
  • 7
  • 108
  • 148