3

Question: How to 'pd.read_csv' so that the values in a given column are of type list (a list in each row of a column)?

When creating a DataFrame (from a dict, see below), individual values are of type list. The problem: After writing the DataFrame to a file and reading from the file back to a DataFrame, I get a string instead of a list.

creating the DataFrame
import pandas as pd
dict2df = {"euNOG": ["ENOG410IF52", "KOG2956", "KOG1997"], 
           "neg": [[58], [1332, 753, 716, 782], [187]], 
           "pos": [[96], [659, 661, 705, 1228], [1414]]}
df = pd.DataFrame(dict2df)

value is a list

type(df.loc[0, 'neg']) == list # --> True
type(df.loc[0, 'neg']) == str # --> False
df.loc[1, 'neg'][-1] == 782 # --> True
write to file
df.to_csv('DataFrame.txt', sep='\t', header=True, index=False)
read from file
df = pd.read_csv('DataFrame.txt', sep='\t')

value is a string not a list

type(df.loc[0, 'neg']) == list # --> False
type(df.loc[0, 'neg']) == str # --> True
df.loc[1, 'neg'][-1] == 782 # --> False

Of course, it's possible to convert between the two data types, but it's computationally expensive and needs extra work (see below)

def convert_StringList2ListOfInt(string2convert):
    return [int(ele) for ele in string2convert[1:-1].split(',')]

def DataFrame_StringOfInts2ListOfInts(df, cols2convert_list):
    for column in cols2convert_list:
        column_temp = column + "_temp"
        df[column_temp] = df[column].apply(convert_StringList2ListOfInt, 1)
        df[column] = df[column_temp]
        df = df.drop(column_temp, axis=1)
    return df
df = DataFrame_StringOfInts2ListOfInts(df, ['neg', 'pos'])

What would be a better (more pythonic) solution? It would be very convenient to iterate over the Integers in the list without having to convert them back and forth. Thank you for your support!!

tryptofame
  • 352
  • 2
  • 7
  • 18

1 Answers1

3

You can use ast.literal_eval() to convert the strings to lists.

A simple example of ast.literal_eval() -

>>> import ast
>>> l = ast.literal_eval('[10,20,30]')
>>> type(l)
<class 'list'>

For your case, you can pass it to Series.apply , so that each element in the series is evaluated (safely). Example -

df = pd.read_csv('DataFrame.txt', sep='\t')
import ast
df['neg_list'] = df['neg'].apply(ast.literal_eval)
df = df.drop('neg',axis=1)
df['pos_list'] = df['pos'].apply(ast.literal_eval)
df = df.drop('pos',axis=1)

Demo -

In [15]: import pandas as pd

In [16]: dict2df = {"euNOG": ["ENOG410IF52", "KOG2956", "KOG1997"],
   ....:            "neg": [[58], [1332, 753, 716, 782], [187]],
   ....:            "pos": [[96], [659, 661, 705, 1228], [1414]]}

In [17]: df = pd.DataFrame(dict2df)

In [18]: df.to_csv('DataFrame.txt', sep='\t', header=True, index=False)

In [19]: newdf = pd.read_csv('DataFrame.txt', sep='\t')

In [20]: newdf['neg']
Out[20]:
0                     [58]
1    [1332, 753, 716, 782]
2                    [187]
Name: neg, dtype: object

In [21]: newdf['neg'][0]
Out[21]: '[58]'

In [22]: import ast

In [23]: newdf['neg_list'] = newdf['neg'].apply(ast.literal_eval)

In [24]: newdf = newdf.drop('neg',axis=1)

In [25]: newdf['pos_list'] = newdf['pos'].apply(ast.literal_eval)

In [26]: newdf = newdf.drop('pos',axis=1)

In [27]: newdf
Out[27]:
         euNOG               neg_list               pos_list
0  ENOG410IF52                   [58]                   [96]
1      KOG2956  [1332, 753, 716, 782]  [659, 661, 705, 1228]
2      KOG1997                  [187]                 [1414]

In [28]: newdf['neg_list'][0]
Out[28]: [58]
Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
  • thank you! this is more pythonic in the sense that it uses a library instead of my own code. I was hoping to find a solution that would include this conversion when reading the data frame. – tryptofame Nov 02 '15 at 11:14