3

I have been struggling with this for a few days now. I read a lot online, found some similar questions such as: Pandas counting occurrence of list contained in column of lists or pandas: count string criteria across down rows but neither fully work in this case.

I have two dataframes: df1 consists of a column of strings. df2 consists of a column of lists (the lists are a combination of the strings from df1, each element within one list is unique).

I would like to know in how many lists of df2 occur each combination of strings. So, how many lists have "a" and "b" as elements? How many lists have "a" and "c" as elements and so forth.

This is how df1 looks like (simplified):

df1 = pd.DataFrame({"subject": ["a", "b", "c"]})

df1
    subject
0   a
1   b
3   c

This is how df2 looks like (simplified).

df2 = pd.DataFrame({"subject_list": [["a", "b" ,"c"], ["b", "c"], ["a", "b"], ["b", "c"], ["c"]]})

df2

     subject_list
0    ["a", "b" ,"c"]
1    ["a", "b"] 
2    ["b", "c"]
3    ["c"]
4    ["b", "c"]

I have two codes which both work but aren't quite right:

This code looks for the combination of two rows in df1 (as wanted). However, df1 includes more rows than df2 so it stops with the last row of df2. But there are still some "string-combinations" to test.

df1["combination_0"] = df2["subject_list"].apply(lambda x: x.count(x and df.subject[0]))

This code counts the occurrence of one "list". However, I can't figure out how to change it so that it does it for each value combination.

df1["list_a_b"] = df2["subject_list"].apply(lambda x: x.count(df1.subject[0] and df1.subject[1]))
df1.list_a_b.sum()
Hannah
  • 329
  • 2
  • 3
  • 8
  • 2
    What is your expected output? Could you also please include the outputs for `df` and `df2` in your question (for readability sakes) please? – cs95 Jun 11 '18 at 16:21
  • There are some errors in your example. Please try running the example code you provided. The first line should be `df1 = pd.DataFrame({"subject": ["a", "b", "c"]})` but the others have issues too. – Zev Jun 11 '18 at 16:24
  • This question is on the right track in that it is great to see a couple things you've tried but I'm not sure what you are looking for and your samples have simple mistakes in them. – Zev Jun 11 '18 at 16:36
  • Sorry for the messy design and coding in the first place! – Hannah Jun 11 '18 at 16:39

5 Answers5

1

Here is the solution I attempted.

Starting with the two dataframes that you have, you can use itertools to get all the possible combinations of the elements of df1 two by two:

import itertools

df1 = pd.DataFrame({"subject": ["a", "b", "c"]})
df2 = pd.DataFrame({"subject_list": [["a", "b", "c"], ["b", "c"], ["a", "b"], ["b", "c"], ["c"]]})

# Create a new dataframe with one column that has the possible two by two combinations from `df1`
df_combinations = pd.DataFrame({'combination': list(itertools.combinations(df1.subject, 2))})

Then loop through the new dataframe, df_combinations in this case, to find out how many times each combination occurs in df2:

for index, row in df_combinations.iterrows():

    df_combinations.at[index, "number of occurrences"] = df2["subject_list"].apply(lambda x: all(i in x for i in row['combination'])).sum()

The main difference in this step with respect to your original solution is that I am not using x.count but rather all since this one guarantees that only instances where both values are present will be counted.

Finally df_combinations is:

  combination  number of occurrences
0      (a, b)                    2.0
1      (a, c)                    1.0
2      (b, c)                    3.0
Ana Cruz
  • 36
  • 4
0

This problem is somewhat difficult because depending upon how many values you have, there can be a lot of pair-wise comparisons. I think you may want to create a dummy df with dummies for each value, and then you can use .all to easily query whatever pair-wise combination you want. It's also easy to generalize if you then want combinations of any number of elements.

First create the df_dummy which indicates whether that value is contained within the list.

df_dummy = df2.subject_list.str.join(sep='?').str.get_dummies(sep='?')
#   a  b  c
#0  1  1  1
#1  0  1  1
#2  1  1  0
#3  0  1  1
#4  0  0  1

Then create your list of all pair-wise combinations you need to make (ignoring order) and the same values

vals = df1.subject.unique()
combos = list((vals[j], vals[i]) for i in range(len(vals)) for j in range(len(vals)) if i>j)
print(combos)
#[('a', 'b'), ('a', 'c'), ('b', 'c')]

Now check for all pair-wise combinations:

for x, y in combos:
    df2[x+'_and_'+y]=df_dummy[[x, y]].all(axis=1)

df2 is:

  subject_list  a_and_b  a_and_c  b_and_c
0    [a, b, c]     True     True     True
1       [b, c]    False    False     True
2       [a, b]     True    False    False
3       [b, c]    False    False     True
4          [c]    False    False    False

If you want to count the total, then just use sum, ignoring the first column

df2[df2.columns[1:]].sum()
#a_and_b    2
#a_and_c    1
#b_and_c    3
#dtype: int64
ALollz
  • 57,915
  • 7
  • 66
  • 89
0

Here is my attempt to solve your problem.

There are two main steps:

  • generate all the possible lists to check from the values of df1
  • count how many rows in df2 contains each combination

Code:

import itertools

def all_in(elements, a_list):
    # Check if all values in the list elements are present in a_list
    return all(el in a_list for el in elements)

# All the (unique) values in df1
all_values = sorted(set(df1.sum()['subject']))

result = pd.Series()

# For each sequence length (1, 2, 3)
for length in range(1, len(all_values)+1):
    # For each sequence of fixed length
    for comb in itertools.combinations(all_values, length):
        # Count how many rows of df2 contains the sequence
        result["_".join(comb)] = df2.squeeze().apply(lambda x: all_in(comb, x)).sum()

which gives:

result

a        2
b        4
c        4
a_b      2
a_c      1
b_c      3
a_b_c    1

Depending on the size of the actual data and on your requirements, you could make things smarter. For example, if you know that 'a' is not in a row, then you would automatically assign False to any combination including 'a'

FLab
  • 7,136
  • 5
  • 36
  • 69
0

Here is a non-Pandas solution using collections.defaultdict and itertools.combinations. There are 2 parts to the logic:

  1. Calculate all combinations from df1['subject'].
  2. Iterate df2['subject_list'] and increment dictionary counts.

frozenset is used purposely since they are hashable and indicate, as in your question, that order is not relevant.

from collections import defaultdict
from itertools import combinations

df1 = pd.DataFrame({"subject": ["a", "b", "c"]})
df2 = pd.DataFrame({"subject_list": [["a", "b" ,"c"], ["b", "c"], ["a", "b"], ["b", "c"], ["c"]]})

# calculate all combinations
combs = (frozenset(c) for i in range(1, len(df1.index)+1) \
         for c in combinations(df1['subject'], i))

# initialise defaultdict
d = defaultdict(int)

# iterate combinations and lists
for comb in combs:
    for lst in df2['subject_list']:
        if set(lst) >= comb:
            d[comb] += 1

print(d)

defaultdict(int,
            {frozenset({'a'}): 2,
             frozenset({'b'}): 4,
             frozenset({'c'}): 4,
             frozenset({'a', 'b'}): 2,
             frozenset({'a', 'c'}): 1,
             frozenset({'b', 'c'}): 3,
             frozenset({'a', 'b', 'c'}): 1})
jpp
  • 159,742
  • 34
  • 281
  • 339
0

Here is yet another approach. The two main insights are as follows:

  1. We can start by intersecting each list in df2 with values of df1. This way we can avoid considering redundant subsets of each row of df2.

  2. After step 1, df2 may contain duplicated sets. Collecting the duplicated may speed up the remaining computation.

The remaining task is to consider every subset of df1 and count the number of occurrences.


import pandas as pd
import numpy as np
from itertools import combinations
from collections import Counter

df1 = pd.DataFrame({"subject": ["a", "b", "c"]})

df2 = pd.DataFrame(
    {
        "subject_list": [
            ["a", "b", "c", "x", "y", "z", "1", "2", "3"],
            ["b", "c"],
            ["a", "b"],
            ["b", "c"],
            ["c"],
        ]
    }
)

s1 = set(df1.subject.values)


def all_combs(xs):
    for k in range(1, len(xs) + 1):
        yield from combinations(xs, k)


def count_combs(xs):
    return Counter(all_combs(xs))


res = (
    df2.subject_list.apply(s1.intersection)
    .apply(frozenset)
    .value_counts()
    .reset_index()
)

# (b, c)       2
# (c, b, a)    1
# (c)          1
# (b, a)       1

res2 = res["index"].apply(df1.subject.isin).mul(res.subject_list, axis=0)
res2.columns = df1.subject

# subject  a  b  c
# 0        0  2  2
# 1        1  1  1
# 2        0  0  1
# 3        1  1  0

res3 = pd.Series(
    {
        "_".join(comb): res2[comb][(res2[comb] > 0).all(1)].sum(0).iloc[0]
        for comb in map(list, all_combs(df1.subject.values))
    }
)


# a        2
# b        4
# c        4
# a_b      2
# a_c      1
# b_c      3
# a_b_c    1
# dtype: int64
hilberts_drinking_problem
  • 11,322
  • 3
  • 22
  • 51