3

I have the following dataframe:

  address.state    address.town      dates 
0            MI     Dearborn         None
1            CA  Los Angeles         [2014-01-01, 2015-01-01]

How would I get a list of all values for the column, splitting up if there's ever a list value. For example:

>>> df['address.state']
['MI', 'CA'] # length of 2
>>> df['dates']
[None, '2014-01-01', '2015-01-01'] # length of 3

How would I do this in a generalized way if any of the values in the df columns has a list field?

Currently what I'm doing is:

_values = []
for _val in df.iloc[:,col_index]:
    if not isinstance(_val, list):
        _values.append(_val)
    else:
        _values.extend(_val)

>>> _values
['2014-01-01', '2015-01-01', None]

Is there a better way to do this though, perhaps directly in pandas?

  • there are lots of possibility in your expected answer, please cover all the possible edge cases in your question. and what is the expected output for all different cases? – Nihal Dec 29 '18 at 07:07
  • @Nihal see updated question please. –  Dec 29 '18 at 07:09
  • you want only lists, what if the value is only integer not a list?, Why there is `None` in your output?? – Nihal Dec 29 '18 at 07:10
  • @Nihal because it's a non-list value? Basically, I want a list of all values in that column...so any list should be added to that list. –  Dec 29 '18 at 07:11
  • i think your way is right. pandas don't have direction function for that. you have `none`, so numpy `flatten` will not work. – Nihal Dec 29 '18 at 07:41

2 Answers2

0

I guess I originally misread the question:

to get unique values in each column:

df['col'].apply(pd.Series).stack().unique()

for example:

df = pd.DataFrame({'address': ['MI' , 'CA'], 'name':['John', 'Jane'], 'list': [['ab', 'cd'], 'e']})

df
Out[4]: 
  address      list  name
0      MI  [ab, cd]  John
1      CA         e  Jane

df['list'].apply(pd.Series).stack().unique()
Out[5]: array(['ab', 'cd', 'e'], dtype=object)

.apply(pd.Series) transforms the lists into single elements in multiple new columns.

stack re-stacks the multiple columns into just one.

unique() gets the unique values of that column.

MrE
  • 19,584
  • 12
  • 87
  • 105
0

Use .tolist() to generate a list of the column. The only thing then is that you'd get a list in a list. To flatten the list, if you don't have multi-character strings in there:

>>> l = ['a',['b','c']]
>>> l
['a', ['b', 'c']]
>>> flat_list = [item for sublist in l for item in sublist]
>>> flat_list
['a', 'b', 'c']

(flat_list solution taken from How to make a flat list out of list of lists?)

caveat: Note however, that any iterable in the input will be flattened, including multi-character strings:

>>> l = ['one item', ['b', 'c']]
>>> [item for sublist in l for item in sublist]
['o', 'n', 'e', ' ', 'i', 't', 'e', 'm', 'b', 'c']

That's likely not what you want. Check out this discussion for a more robust way to flatten only certain types of iterables.

gosuto
  • 5,422
  • 6
  • 36
  • 57
  • 1
    also this solution won't work if you column names are more than 1 letter long, because if it's not a array, python will deconstruct a string into the letters of the string: if `l = ['abcd',['b','c']]` then it returns `['a', 'b', 'c', 'd', 'b', 'c']` – MrE Dec 30 '18 at 02:07
  • Great point MrE. The answer at https://stackoverflow.com/a/29325518/1701416 will flatten iterables except for strings. – waterproof Dec 30 '18 at 02:29