0

I'm trying to convert a pandas column from string to a set so I can perform set operations (-) and methods (.union) between two datafame on two set_array columns. The files are imported from two csv file with a set_array column. However, once I run pd.read_csv in pandas, the columns type becomes str, which prevents me from doing set operations and methods.

csv1:

        set_array
0       {985,784}
1       {887}
2       set()
3       {123,469,789}
4       set()

After loading csv1 into a DataFrame using df = pd.read_csv(csv1), the data type becomes str, and when I try to call the first index using df['set_array'].values[0], I get the following:

'{985, 784}'

However, if I were to create my own DataFrame with a set column using df1 = pd.DataFrame({'set_array':[{985, 784},{887},{},{123, 469, 789},{}]}), and call the first index again using df['set_array'].values[0], I get the following (Desired output):

{985, 784} <-without the ''

Here is what I tried so far:

1) df.replace('set()', '') <-removes the set() portion from df
2) df['set_array'] = df['set_array'].apply(set) <-does not work
3) df['set_array'] = df['set_array'].apply(lambda x: {x}) <-does not work
4) df['set_array'].astype(int) <-convert to int first then convert to set, does not work
5) df['set_array'].astype(set) <-does not work
6) df['set_array'].to_numpy() <-convert to array, does not work

I'm also thinking to change the column to set at the pd.read_csv stage as a potential solution.

Is there any way to load csv using pandas and keep the set data type, or just simply convert the column from str to set so it looks like the desired output above?

Thanks!!

Mick
  • 265
  • 2
  • 10
  • 1
    `df.set_array.apply(eval)` would evaluate the column as python code and covert it directly to set types. It's pretty clumsy, and using eval is generally bad practice (https://stackoverflow.com/questions/1832940/why-is-using-eval-a-bad-practice) but it would be the shortest route to the outcome you desire, assuming you're pretty sure about the input data and this is one-off code/not for production. – fordy Oct 20 '20 at 04:17
  • Thanks @fordy! I did see this as a solution, and it did resolve my issue. However, I do plan to use this for production, will this be an issue since i'm only using it number arrays only? – Mick Oct 20 '20 at 04:21
  • 1
    I really recommend not saving your csv file with set in columns. If your want to use other data types than string, int or bool, you can save your files using `pickle`, which preserves the data types (like lists and set). – Cainã Max Couto-Silva Oct 20 '20 at 04:29

3 Answers3

2

I agree with Cainã that dealing with the root input data cause would be the best approach here. But, if that's not possible, then something like this would be a lot more predictable than using eval if this is for some kind of production environment:

import re

def parse_set_string(s):
    if s == 'set()':
        return None  # or return set() if you prefer
    else:
        string_nums_only = re.sub('[^0-9,]', '', s)  
        split_nums = string_nums_only.split(',')
        return set(map(int, split_nums))

    
df.set_array.apply(parse_set_string)
fordy
  • 2,520
  • 1
  • 14
  • 22
2

We've seen this problem before when columns originally contained lists or numpy arrays. csv is a 2d format - rows and columns. So to_csv can only save these embedded objects as strings. What does the file look like?.

read_csv by default just loads the strings. To confuse things further, the pandas display does not quote strings. So the str of a set looks the same as the set itself.

With lists, it's enough to do a eval (or ast.literal_eval). With ndarray the string has to be edited first.

Make a dataframe and fill it with some objects:

In [107]: df = pandas.DataFrame([None,None,None])
In [108]: df
Out[108]: 
      0
0  None
1  None
2  None
In [109]: df[0][0]
In [110]: df[0][0]=[1,2,3]
In [111]: df[0][1]=np.array([1,2,3])
In [112]: df[0][2]={1,2,3}
In [113]: df
Out[113]: 
           0
0  [1, 2, 3]
1  [1, 2, 3]
2  {1, 2, 3}

The numpy equivalent:

In [114]: df.to_numpy()
Out[114]: 
array([[list([1, 2, 3])],
       [array([1, 2, 3])],
       [{1, 2, 3}]], dtype=object)

Write it to a file:

In [115]: df.to_csv('test.pd')
In [116]: cat test.pd
,0
0,"[1, 2, 3]"
1,[1 2 3]
2,"{1, 2, 3}"

Read it

In [117]: df1 = pandas.read_csv('test.pd')
In [118]: df1
Out[118]: 
   Unnamed: 0          0
0           0  [1, 2, 3]
1           1    [1 2 3]
2           2  {1, 2, 3}

Ignoring the indexing that I should have suppressed, it looks a lot like the original df. But it contains strings, not list, array, or set.

In [119]: df1.to_numpy()
Out[119]: 
array([[0, '[1, 2, 3]'],
       [1, '[1 2 3]'],
       [2, '{1, 2, 3}']], dtype=object)

Changing the frame to contains sets of differing sizes:

In [120]: df[0][1]=set()
In [122]: df[0][0]=set([1])
In [123]: df
Out[123]: 
           0
0        {1}
1         {}
2  {1, 2, 3}
In [124]: df.to_csv('test.pd')
In [125]: cat test.pd
,0
0,{1}
1,set()
2,"{1, 2, 3}"

In [136]: df2 =pandas.read_csv('test.pd',index_col=0)
In [137]: df2
Out[137]: 
           0
0        {1}
1      set()
2  {1, 2, 3}

Looks like eval can convert the empty set as well as the others:

In [138]: df3  =df2['0'].apply(eval)
In [139]: df3
Out[139]: 
0          {1}
1           {}
2    {1, 2, 3}
Name: 0, dtype: object
In [140]: df2.to_numpy()
Out[140]: 
array([['{1}'],
       ['set()'],
       ['{1, 2, 3}']], dtype=object)
In [141]: df3.to_numpy()
Out[141]: array([{1}, set(), {1, 2, 3}], dtype=object)
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • thanks for the detailed explanation. The part with pandas not displaying quote strings in dataframe is clear with me after some testing on my end. Your solution mentioned that ```eval``` is enough with list, but what about ```sets``` in my case, where I have any a set array? – Mick Oct 20 '20 at 05:05
1

The problem with your DataFrame is that set_array contains the text representation of both:

  • set literals,
  • Python code.

To cope with this case:

  1. import ast.

  2. Define the following conversion function:

    def mySetConv(txt):
        return set() if txt == 'set()' else ast.literal_eval(txt)
    
  3. Apply it:

    df.set_array = df.set_array.apply(mySetConv)
    

To check the result, you can run:

for it in df.set_array:
    print(it, type(it))

getting:

{784, 985} <class 'set'>
{887} <class 'set'>
set() <class 'set'>
{789, 123, 469} <class 'set'>
set() <class 'set'>

If you had in your source file {} instead of set(), you could run:

df.set_array = df.set_array.apply(ast.literal_eval)

Just a single line of code.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • Thanks @Valdi! Quick question, how does the earlier answer with ```.apply(eval)``` differ from your solution using ```.apply(ast.literal_eval)``` as they both solve the ```set()``` and the ```{}``` issue? – Mick Oct 20 '20 at 06:09
  • "Ordinary" *eval* is less safe. It allows for *code injection*, so it is safer to use *ast.literal_eval*. Even one of comments to your post contains *using eval is generally bad practice*. – Valdi_Bo Oct 20 '20 at 10:18