142

I have an n-by-m Pandas DataFrame df defined as follows. (I know this is not the best way to do it. It makes sense for what I'm trying to do in my actual code, but that would be TMI for this post so just take my word that this approach works in my particular scenario.)

>>> df = DataFrame(columns=['col1'])
>>> df.append(Series([None]), ignore_index=True)
>>> df
Empty DataFrame
Columns: [col1]
Index: []

I stored lists in the cells of this DataFrame as follows.

>>> df['column1'][0] = [1.23, 2.34]
>>> df
     col1
0  [1, 2]

For some reason, the DataFrame stored this list as a string instead of a list.

>>> df['column1'][0]
'[1.23, 2.34]'

I have 2 questions for you.

  1. Why does the DataFrame store a list as a string and is there a way around this behavior?
  2. If not, then is there a Pythonic way to convert this string into a list?

Update

The DataFrame I was using had been saved and loaded from a CSV format. This format, rather than the DataFrame itself, converted the list from a string to a literal.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Gyan Veda
  • 6,309
  • 11
  • 41
  • 66
  • Does this answer your question? [Convert string representation of list to list](https://stackoverflow.com/questions/1894269/convert-string-representation-of-list-to-list) – AMC Feb 20 '20 at 19:49

9 Answers9

146

As you pointed out, this can commonly happen when saving and loading pandas DataFrames as .csv files, which is a text format.

In your case this happened because list objects have a string representation, allowing them to be stored as .csv files. Loading the .csv will then yield that string representation.

If you want to store the actual objects, you should use DataFrame.to_pickle() (note: objects must be picklable!).

To answer your second question, you can convert it back with ast.literal_eval:

>>> from ast import literal_eval
>>> literal_eval('[1.23, 2.34]')
[1.23, 2.34]
EliadL
  • 6,230
  • 2
  • 26
  • 43
anon582847382
  • 19,907
  • 5
  • 54
  • 57
95

You can directly use pandas -

import pandas as pd
df = pd.read_csv(DF_NAME, converters={'COLUMN_NAME': pd.eval})

This will read that column as a it's corresponding dtype in python instead of a string.

UPDATE:

As rightfully pointed out by @ctwardy in comments. It is wiser to use pd.eval instead of eval to avoid unintended regex related consequences. Details - https://realpython.com/python-eval-function/#minimizing-the-security-issues-of-eval

markroxor
  • 5,928
  • 2
  • 34
  • 43
  • 3
    Just be sure you are using pandas.eval and not builtin eval. Builtin eval will happily delete all your files if the right expression is in your csv file. See https://realpython.com/python-eval-function/#minimizing-the-security-issues-of-eval. – ctwardy Sep 25 '21 at 20:41
  • 2
    @markroxor - Update: `ast.literal_eval` was 15x faster, and looking more it seems `pd.eval` can fall back to Python eval. – ctwardy Sep 29 '21 at 15:49
  • 1
    It's good - `pd.eval` has a `python` engine but that's still limited syntax. It refuses to parse `__import__` and even `sum`. (`literal_eval` does seem faster for this use case.) – ctwardy Sep 30 '21 at 12:04
  • `ast.literal_eval` has the benefit of being able to convert also more complex objects (e.g. list of dicts) where `pd.eval` fails. – אלימלך שרייבר Apr 23 '23 at 20:13
34
  • Use ast.literal_eval to safely evaluate a string containing a Python literal or container datatype.
    • It's part of the standard library

    • Using python's eval() vs. ast.literal_eval()? explains why literal_eval is safer than using eval.

    • Examples:

      • literal_eval("[1.23, 2.34]") works
      • literal_eval("['KB4523205','KB4519569','KB4503308']") works
        • Other answers mention pd.eval, but its usage is limited; it results in ValueError: NumExpr 2 does not support Unicode as a dtype. for this simple example.
      • literal_eval("[KB4523205, KB4519569, KB4503308]") doesn't work (no quotes around the str values)
  • Convert the column when reading the file, by using the converters parameter of pandas.read_csv.

data in test.csv

col1
"[1.23, 2.34]"
"['KB4523205','KB4519569','KB4503308']"

Convert the column when creating the csv

from ast import literal_eval
import pandas as pd

# convert the column during import
df = pd.read_csv('test.csv', converters={'col1': literal_eval})

# display(df)
                                col1
0                       [1.23, 2.34]
1  [KB4523205, KB4519569, KB4503308]

# check type
print(type(df.iloc[0, 0]))
list

print(type(df.iloc[1, 0]))
list

Convert the column of an existing dataframe

df.col1 = df.col1.apply(literal_eval)

%%timeit

  • pd.eval is 28 times slower than literal_eval
  • Given test.csv with 2,820,511 rows of "[1.23, 2.34]"

enter image description here

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
13

I just came across this problem and there is a very simple solution (pandas.eval()). I'm using pandas 0.20.0.

# SETUP
import pandas as pd
import io

csv = io.StringIO(u'''
id  list
A1  [1,2]
A2  [3,4]
A3  [5,6]
''')

df = pd.read_csv(csv, delim_whitespace = True)

# TYPE CHECK <type 'str'>
print type(df.at[0, 'list'])

# MAIN CONVERSION
df['list'] = pd.eval(df['list'])

# TYPE CHECK <type 'list'>
print type(df.at[0, 'list'])
elPastor
  • 8,435
  • 11
  • 53
  • 81
3

1) There is a way around this behavior. Use loc helps here.

>>> import pandas as pd

>>> df = pd.DataFrame(columns=['column1'])
>>> df = df.append(pd.Series(data = {'column1':[None]}), ignore_index = True)

   column1
0  [None]

>>> # Add list to index 0 in column1
>>> df.loc[0,'column1'] = [1.23, 2.34]
>>> print(df.loc[0, 'column1'])
[1.23, 2.34]

2) Pythonic way to convert this string into a list. (This is probably what you want as the DataFrame you are using had been been saved and loaded from a CSV format, there are a couple solutions for this). This is an addition on pshep123's answer.

from ast import literal_eval
import pandas as pd

csv = io.StringIO(u'''
id  list
A1  [1,2]
A2  [3,4]
A3  [5,6]
''')
df = pd.read_csv(csv, delim_whitespace = True)

# Output is a string
df.loc[0, 'list']
'[1,2]'

# Convert entire column to a list
df.loc[:,'list'] = df.loc[:,'list'].apply(lambda x: literal_eval(x))

# Output is a list
df.loc[0, 'list']
[1, 2]
2

I had the same problem. When storing a dataframe list column to a CSV file using df.to_csv(), list columns are converted to a string e.g. "[42, 42, 42]" instead of [42, 42, 42]

Alex answer is correct and you can use literal_eval to convert the string back to a list. The problem with this approach is that you need to import an additional library and you need to apply or map the function to your dataframe. As easier way is to force Pandas to read the column as a Python object (dtype)

df["col1"].astype('O')

The O is used for Python objects including lists. More info here. Please note that this method fails if you parse empty list strings: "[]"

Alternatively you can also apply a function to your column (this one is for integers):

def stringToList(string):
    # input format : "[42, 42, 42]" , note the spaces after the commas, in this case I have a list of integers
    string = string[1:len(string)-1]
    try:
        if len(string) != 0: 
            tempList = string.split(", ")
            newList = list(map(lambda x: int(x), tempList))
        else:
            newList = []
    except:
        newList = [-9999]
    return(newList)

df["col1"] = df["col1"].apply(lambda x: stringToList(x))
Rutger Hofste
  • 4,073
  • 3
  • 33
  • 44
0

for reference only... pandas don't convert lists into string. ..

In [29]: data2 = [{'a': [1, 5], 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]                                                                                        

In [30]: df = pd.DataFrame(data2)                                                                                                                           

In [31]: df                                                                                                                                                 
Out[31]: 
        a   b   c
0  [1, 5]   2 NaN
1       5  10  20

In [32]: df['a'][0], type(df['a'][0])                                                                                                                       
Out[32]: ([1, 5], list)

In [33]: pd.__version__
Out[33]: '0.12.0'
namit
  • 6,780
  • 4
  • 35
  • 41
0

A simple hack I used is to call a lambda function that indexes out the first and last elements (the list brackets in str form) and calls the split method followed by another that replaces the list elements with ints.

df['column1'] = df['column1'].apply(lambda x:x[1:-1].split(',')).apply(lambda x:[int(i) for i in x])
Hassen Morad
  • 27
  • 2
  • 7
0

Adding onto Alex's answer. Here is another version which can be used for converting individual items from string to list

import pandas as pd
from ast import literal_eval

df = pd.read_csv("some_csvfile.csv")

def item_gen(l):
    for i in l:
        yield(i)

for i in item_gen(df["some_column_with_list_item"]):
    print(literal_eval(i))
Abercrombie
  • 1,012
  • 2
  • 13
  • 22