How to perform various operations on a pandas DataFrame column containing a list of tuples in Python?

Question

I have a pandas DataFrame, in which one column resources consists of a list of tuples. For example, take the following DataFrame:

df = pd.DataFrame({"id": [1, 2, 3],
                   "resources": [[(1, 3), (1, 1), (2, 9)], 
                               [(3, 1), (3, 1), (3, 4)], 
                               [(9, 0), (2, 6), (5,5)]]
                  })

Now, I want to add the following columns to my DataFrame, which contain the following:

A column first containing a list with the unique first elements of the tuples in resources (so basically a set of all the first elements)
A column second containing a list with the unique second elements of the tuples in resources (so basically a set of all the second elements)
A column same containing the number of tuples in resources having the same first and second element
A column different containing the number of tuples in resources having different first and second elements

the desired output columns would look like this:

first: [[1, 2], [3], [9, 2, 5]]
second: [[1, 3, 9], [1, 4], [0, 6, 5]]
same: [1, 0, 1]
different: [2, 3, 2]

How to achieve this in the least time consuming way? I was first thinking of using Series.str, but could not find enough functionality there to achieve my goal

Please repeat [on topic](https://stackoverflow.com/help/on-topic) and [how to ask](https://stackoverflow.com/help/how-to-ask) from the [intro tour](https://stackoverflow.com/tour). “Show me how to solve this coding problem” is not a Stack Overflow issue. We expect you to make an honest attempt, and *then* ask a *specific* question about your algorithm or technique. Stack Overflow is not intended to replace existing documentation and tutorials. — Prune, Apr 22 '21 at 22:56
I mentioned in my answer I tried using Series.str, which did not work. Then I went looking for all functionality there and could not find anything more useful — Peter, Apr 22 '21 at 23:04

score 7 · Accepted Answer · answered Apr 22 '21 at 22:50

7

df["first"] = df["resources"].apply(lambda x: [*set(i for i, _ in x)])
df["second"] = df["resources"].apply(lambda x: [*set(i for _, i in x)])
df["same"] = df["resources"].apply(lambda x: sum(len(set(t)) == 1 for t in x))
df["different"] = df["resources"].apply(
    lambda x: sum(len(set(t)) > 1 for t in x)
)

print(df)

Prints:

   id                 resources      first     second  same  different
0   1  [(1, 3), (1, 1), (2, 9)]     [1, 2]  [1, 3, 9]     1          2
1   2  [(3, 1), (3, 1), (3, 4)]        [3]     [1, 4]     0          3
2   3  [(9, 0), (2, 6), (5, 5)]  [9, 2, 5]  [0, 5, 6]     1          2

answered Apr 22 '21 at 22:50

Andrej Kesely

168,389
15
48
91

Thank you very much! What is the purpose of the "*" before set? – Peter Apr 22 '21 at 23:04
3

[What does the star and doublestar operator mean in a function call?](https://stackoverflow.com/q/2921847/15497888) – Henry Ecker Apr 22 '21 at 23:05

How to perform various operations on a pandas DataFrame column containing a list of tuples in Python?

1 Answers1