2

I have a pandas DataFrame, in which one column resources consists of a list of tuples. For example, take the following DataFrame:

df = pd.DataFrame({"id": [1, 2, 3],
                   "resources": [[(1, 3), (1, 1), (2, 9)], 
                               [(3, 1), (3, 1), (3, 4)], 
                               [(9, 0), (2, 6), (5,5)]]
                  })

Now, I want to add the following columns to my DataFrame, which contain the following:

  • A column first containing a list with the unique first elements of the tuples in resources (so basically a set of all the first elements)
  • A column second containing a list with the unique second elements of the tuples in resources (so basically a set of all the second elements)
  • A column same containing the number of tuples in resources having the same first and second element
  • A column different containing the number of tuples in resources having different first and second elements

the desired output columns would look like this:

  • first: [[1, 2], [3], [9, 2, 5]]
  • second: [[1, 3, 9], [1, 4], [0, 6, 5]]
  • same: [1, 0, 1]
  • different: [2, 3, 2]

How to achieve this in the least time consuming way? I was first thinking of using Series.str, but could not find enough functionality there to achieve my goal

Peter
  • 722
  • 6
  • 24
  • Please repeat [on topic](https://stackoverflow.com/help/on-topic) and [how to ask](https://stackoverflow.com/help/how-to-ask) from the [intro tour](https://stackoverflow.com/tour). “Show me how to solve this coding problem” is not a Stack Overflow issue. We expect you to make an honest attempt, and *then* ask a *specific* question about your algorithm or technique. Stack Overflow is not intended to replace existing documentation and tutorials. – Prune Apr 22 '21 at 22:56
  • I mentioned in my answer I tried using Series.str, which did not work. Then I went looking for all functionality there and could not find anything more useful – Peter Apr 22 '21 at 23:04

1 Answers1

7
df["first"] = df["resources"].apply(lambda x: [*set(i for i, _ in x)])
df["second"] = df["resources"].apply(lambda x: [*set(i for _, i in x)])
df["same"] = df["resources"].apply(lambda x: sum(len(set(t)) == 1 for t in x))
df["different"] = df["resources"].apply(
    lambda x: sum(len(set(t)) > 1 for t in x)
)

print(df)

Prints:

   id                 resources      first     second  same  different
0   1  [(1, 3), (1, 1), (2, 9)]     [1, 2]  [1, 3, 9]     1          2
1   2  [(3, 1), (3, 1), (3, 4)]        [3]     [1, 4]     0          3
2   3  [(9, 0), (2, 6), (5, 5)]  [9, 2, 5]  [0, 5, 6]     1          2
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91