Getting unique values from pandas column of 2d array cells

Question

I have a pandas DataFrame where each cell in a column is a 2d array of items.

EX: Observation 1 has column items with values ['Baseball', 'Glove','Snack']

When I use .unique on the individual cells, each cell gets analyzed based on the whole arrays value, not individual values in the array.

How can I iterate through each array in each cell to determine the true unique amount of items in the column? Thanks

  Items
0 ['Baseball', 'Hockey Stick', 'Mit']
1 ['Mit', 'Tennis Racket']
2 ['Baseball', 'Helmet']

These all return as unique values, I would like to get the unique count for each value in each list.

"I have a pandas DataFrame where each cell in a column is a 2d array of items." then you almost certainly shouldn't be using pandas. Store only scalar values in cells. Pandas is just not geared for this, use numpy if possible or just go back to base Python and drop all the added complexity. — roganjosh, Mar 15 '19 at 20:02
This is relevant: https://stackoverflow.com/questions/30565759/get-unique-values-in-list-of-lists-in-python. Just replace the list with `df.Items` — ALollz, Mar 15 '19 at 20:03
Yes, I understand to only store scalar values in cells, however this is a homework problem. Not real world case. — Garrett London, Mar 15 '19 at 20:03
Your homework _requires_ you to use pandas with numpy arrays in each cell? That doesn't make sense. What I'm saying is that if you have come to this approach and it's not a requirement, you will want to rethink the approach. — roganjosh, Mar 15 '19 at 20:05
The requirement is to find unique values however possible. The data was given with an array in each column cell, this is all i know. — Garrett London, Mar 15 '19 at 20:09
what exactly do you want? what i understand is this: `from collections import OrderedDict` and `list(OrderedDict.fromkeys(list(itertools.chain.from_iterable(df.Items))).keys())` — anky, Mar 15 '19 at 20:09
Ok, so this data should never have been put in a dataframe. As a data structure, it just doesn't fit the problem. Take a step back and give the raw data — roganjosh, Mar 15 '19 at 20:12
Can you add your expected output? I'm still confused about what you need here, and whether you need counts or unique items, or both — ALollz, Mar 15 '19 at 20:18
I figured it out, I just used a double for loop to iterate through. The one liners sometimes are too complex — Garrett London, Mar 15 '19 at 20:26

tobsecret · Answer 1 · 2019-03-15T20:26:25.680

0

You can use np.unique and np.concatenate on the column of interest. I have made an example below:

import pandas as pd
import numpy as np

df = pd.DataFrame({'fruits':(np.array(['banana', 'apple']), np.array(['cherry', 'apple']))})
#   items
#0  [banana, apple]
#1  [cherry, apple]
np.concatenate(df.fruits.values) #.values accesses the numpy array representation of the column
#array(['banana', 'apple', 'cherry', 'apple'],
#      dtype='<U6')
np.unique(np.concatenate(df.fruits.values)) #unique items
#array(['apple', 'banana', 'cherry'],
#      dtype='<U6')
np.unique(np.concatenate(df.fruits.values), return_counts=True) #counts
#(array(['apple', 'banana', 'cherry'],
#   dtype='<U6'), array([2, 1, 1]))
subset = df.fruits.dropna() # getting rid of NaNs
subset.loc[subset.map(len)!=0] #get rid of zero-length arrays
#0    [banana, apple]
#1    [cherry, apple]
#Name: fruits, dtype: object
np.unique(np.concatenate(subset.loc[subset.map(len)!=0].values), return_counts=True) #This works as desired
#(array(['apple', 'banana', 'cherry'],
   dtype='<U6'), array([2, 1, 1]))

edited Mar 15 '19 at 20:26

answered Mar 15 '19 at 20:05

tobsecret

2,442
15
26

I don't understand how this gives a count? – roganjosh Mar 15 '19 at 20:08
What if the array in the cell is empty? – Garrett London Mar 15 '19 at 20:08
This should tolerate empty cells. For a count you can simply take the length of the array of unique values. Hope I understood that correctly. – tobsecret Mar 15 '19 at 20:10
Well, 'apple' appears twice originally, but only once in the output. So all counts will be 1 – roganjosh Mar 15 '19 at 20:11
Oh, I guess I did not read the edited question. Let me update the solution. – tobsecret Mar 15 '19 at 20:12
ValueError: zero-dimensional arrays cannot be concatenated – Garrett London Mar 15 '19 at 20:16
Yup, figureing that out now, as well. A quick hack would be to filter those out. This solution also does not tolerate NaNs. Lemme update – tobsecret Mar 15 '19 at 20:18
Check out the updated version! Could be faster to use subset.str.len() instead of subset.map(len) but probably not a concern given that your homework problem is a small size. – tobsecret Mar 15 '19 at 20:27

score 0 · Accepted Answer · answered Mar 15 '19 at 20:22

I would use chain method of itertools together with sets as to solve the problem as follows.

# you have a dataframe called data with the column items.

from itertools import chain
unique_lists_in_items = data.items.unique().tolist()
set_of_items = set(chain(*unique_lists_in_items))

set_of_items is what you want.

Getting unique values from pandas column of 2d array cells

2 Answers2

Linked