Extracting superset list from pandas dataframe

Question

In the foll. dataframe, I have a collection of year and month values as tuples in a list:

state
alabama           [(2017.0, 10.0), (2017.0, 11.0), (2017.0, 12.0), (2018.0, 1.0)]
arkansas          [(2017.0, 10.0), (2017.0, 11.0), (2017.0, 12.0)]
colorado          [(2017.0, 9.0), (2017.0, 10.0), (2017.0, 11.0)]

How can I extract a superset list of year and month combinations? In this case, the soln would be:

[(2017.0, 9.0), (2017.0, 10.0), (2017.0, 11.0), (2017.0, 12.0), (2018.0, 1.0)]

I could potentially do it using a for loop but that would be slow, anything more pythonic?

Here is what I tried:

for row in df:
    if all(y in row for x, y in df):
        tmp = row

but I get this error:

ValueError: too many values to unpack (expected 2)

With all due respect, you seem to be outsourcing each step of your data-cleaning operation to Stack Overflow. It doesn't reflect well on the amount of effort you put into each question. — miradulo, Dec 04 '17 at 20:49
@miradulo, this is a tricky bit, so asked. but you do have a point — user308827, Dec 04 '17 at 20:51
That doesn't look like much of an effort :P When you've cornered yourself into having lists of tuples inside your DataFrame like this, your solution likely won't improve much over a basic Python level approach. Something like `set(itertools.chain.from_iterable(df.state.values))` maybe. — miradulo, Dec 04 '17 at 21:01
I guess it would be easier to do `df[['Year','Month']].sort_values(['Year','Month']).drop_duplicates().values.tolist()` applying it to the DF from your previous question... — MaxU - stand with Ukraine, Dec 04 '17 at 21:04
@user308827 Well then you'll have to add your DataFrame from the previous question, lol. In which case the answer is "here's what you should have done two questions ago", which kinda drives home my point about how you're asking questions. — miradulo, Dec 04 '17 at 21:07
"I could potentially do it using a for loop but that would be slow, anything more pythonic?" Your forced to be slow: you have a data-frame containing list-of-tuples. The solution `miraludo` gave is probably going to be as fast as anything with `pandas`. — juanpa.arrivillaga, Dec 04 '17 at 21:07
@miradulo, I do agree with your point, I should have planned this better — user308827, Dec 04 '17 at 21:18
@user308827 No worries, just keep it in mind for next time :) — miradulo, Dec 04 '17 at 21:18

score 1 · Accepted Answer · answered Dec 04 '17 at 21:07

Using a sample DF from your previous question:

In [109]: df[['Year','Month']].sort_values(['Year','Month']).drop_duplicates().values.tolist()
Out[109]:
[[2017.0, 10.0],
 [2017.0, 11.0],
 [2017.0, 12.0],
 [2018.0, 1.0],
 [2018.0, 2.0],
 [2018.0, 3.0],
 [2018.0, 4.0],
 [2018.0, 5.0],
 [2018.0, 6.0],
 [2018.0, 7.0]]

Extracting superset list from pandas dataframe

1 Answers1