1

I have a pandas DataFrame where each cell in a column is a 2d array of items.

EX: Observation 1 has column items with values ['Baseball', 'Glove','Snack']

When I use .unique on the individual cells, each cell gets analyzed based on the whole arrays value, not individual values in the array.

How can I iterate through each array in each cell to determine the true unique amount of items in the column? Thanks

  Items
0 ['Baseball', 'Hockey Stick', 'Mit']
1 ['Mit', 'Tennis Racket']
2 ['Baseball', 'Helmet']

These all return as unique values, I would like to get the unique count for each value in each list.

anky
  • 74,114
  • 11
  • 41
  • 70
Garrett London
  • 27
  • 1
  • 10
  • "I have a pandas DataFrame where each cell in a column is a 2d array of items." then you almost certainly shouldn't be using pandas. Store only scalar values in cells. Pandas is just not geared for this, use numpy if possible or just go back to base Python and drop all the added complexity. – roganjosh Mar 15 '19 at 20:02
  • This is relevant: https://stackoverflow.com/questions/30565759/get-unique-values-in-list-of-lists-in-python. Just replace the list with `df.Items` – ALollz Mar 15 '19 at 20:03
  • Yes, I understand to only store scalar values in cells, however this is a homework problem. Not real world case. – Garrett London Mar 15 '19 at 20:03
  • Your homework _requires_ you to use pandas with numpy arrays in each cell? That doesn't make sense. What I'm saying is that if you have come to this approach and it's not a requirement, you will want to rethink the approach. – roganjosh Mar 15 '19 at 20:05
  • The requirement is to find unique values however possible. The data was given with an array in each column cell, this is all i know. – Garrett London Mar 15 '19 at 20:09
  • what exactly do you want? what i understand is this: `from collections import OrderedDict` and `list(OrderedDict.fromkeys(list(itertools.chain.from_iterable(df.Items))).keys())` – anky Mar 15 '19 at 20:09
  • Ok, so this data should never have been put in a dataframe. As a data structure, it just doesn't fit the problem. Take a step back and give the raw data – roganjosh Mar 15 '19 at 20:12
  • Can you add your expected output? I'm still confused about what you need here, and whether you need counts or unique items, or both – ALollz Mar 15 '19 at 20:18
  • I figured it out, I just used a double for loop to iterate through. The one liners sometimes are too complex – Garrett London Mar 15 '19 at 20:26
  • Great that you figured it out, that's pretty valuable! – tobsecret Mar 15 '19 at 20:43

2 Answers2

0

You can use np.unique and np.concatenate on the column of interest. I have made an example below:

import pandas as pd
import numpy as np

df = pd.DataFrame({'fruits':(np.array(['banana', 'apple']), np.array(['cherry', 'apple']))})
#   items
#0  [banana, apple]
#1  [cherry, apple]
np.concatenate(df.fruits.values) #.values accesses the numpy array representation of the column
#array(['banana', 'apple', 'cherry', 'apple'],
#      dtype='<U6')
np.unique(np.concatenate(df.fruits.values)) #unique items
#array(['apple', 'banana', 'cherry'],
#      dtype='<U6')
np.unique(np.concatenate(df.fruits.values), return_counts=True) #counts
#(array(['apple', 'banana', 'cherry'],
#   dtype='<U6'), array([2, 1, 1]))
subset = df.fruits.dropna() # getting rid of NaNs
subset.loc[subset.map(len)!=0] #get rid of zero-length arrays
#0    [banana, apple]
#1    [cherry, apple]
#Name: fruits, dtype: object
np.unique(np.concatenate(subset.loc[subset.map(len)!=0].values), return_counts=True) #This works as desired
#(array(['apple', 'banana', 'cherry'],
   dtype='<U6'), array([2, 1, 1]))
tobsecret
  • 2,442
  • 15
  • 26
0

I would use chain method of itertools together with sets as to solve the problem as follows.

# you have a dataframe called data with the column items.

from itertools import chain
unique_lists_in_items = data.items.unique().tolist()
set_of_items = set(chain(*unique_lists_in_items))

set_of_items is what you want.

Samuel Nde
  • 2,565
  • 2
  • 23
  • 23