1

First, I have an excel file (or csv file) which I have converted to a dataframe (df).

Next, there is one master list of strings in the first column, which contains alphanumeric characters.

Then, there are lists of strings in subsequent columns, which can be the same length (list1), shorter (list2), or longer (list3).

I would like to count (or sum) the number of exact matches between master_list and the other lists.


# Assign spreadsheet filename to `file`
file = "list_match.xlsx"

# Load spreadsheet
import pandas as pd
df = pd.read_excel(file)

print (df)
master_list list1   list2   list3
abc         abc     abc     stu
def         xxx     def     zzz
ghi         xxx     yyy     zzz
jkl         xxx     yyy     zzz
mno1        xxx     yz1     zzz
pqr         xxx             zzz
stu         xxx             zzz
vwx         xxx             zzz
yz1         xxx             zzz
yz2         xx1             zzz
yz3         xx1             zzz
                            zzz
                            mno1
                            zzz

The objective is to create this result:

List     Count
list1    1
list2    3
list3    2

Explanation of result:

  • There is one match in list1: 'abc'
  • There are three matches in list2: 'abc', 'def', 'yz1'
  • There are two matches in list3: 'stu', 'mno1'

My question is similar to this question, except the data is transposed and there are multiple lists to compare which might require a loop.

user2314737
  • 27,088
  • 20
  • 102
  • 114
milaske
  • 55
  • 2
  • 8

4 Answers4

1

You could use df.isin after replacing empty strings with NaN

df.replace(r'^\s*$', 'NaN', regex=True).isin(df.master_list.values).sum()

This counts all the occurrences of elements of df that are in the column df.master_list. Note that [sum()][1] by default along the 0 axis (column-wise) and by default null/NaN values are skipped, so this would be the same as:

df.replace(r'^\s*$', 'NaN', regex=True).isin(df.master_list.values).sum(axis=0, skipna=True)

Example:

>>> df
  list1 list2 master_list
0   abc   stu         abc
1   yyy   xxx         def
2         xxx
3         xxx
4         xxx
>>> df.replace(r'^\s*$', 'NaN', regex=True).isin(df.master_list.values).sum()
list1          1
list2          0
master_list    2
dtype: int64
user2314737
  • 27,088
  • 20
  • 102
  • 114
  • this is will match the index first and the value . – BENY Dec 01 '18 at 20:39
  • Unfortunately, this does not provide the correct solution. `master_list 14, list1 4, list2 12, list3 2` ... I think the issue is that your solution works perfectly when there are no blanks (NaN) and the columns are the same length... `master_list` has 11 rows but the result above `master_list 14` makes me think it's reflecting the length of the longest list `list3`. – milaske Dec 01 '18 at 21:06
  • I get the same results when running this code. Perhaps the difference is because the dataframe in my example is created from an excel file? – milaske Dec 01 '18 at 22:07
  • @milaske maybe you should post the Excel file in your example – user2314737 Dec 01 '18 at 22:13
  • That is a good suggestion. I searched the Help Center and how "How do I ask a good question?" to see if posting the excel file was possible. Unfortunately, I wasn't able to find out how to do that. As the other answers to this question work correctly, it may not be necessary. – milaske Dec 01 '18 at 22:39
  • @mislake In your question it's not clear what the empty fields in the dataframe are (NaNs or empty strings?) – user2314737 Dec 01 '18 at 22:51
  • Excellent observation. You are right, I was not clear. In the excel file, they are empty fields (i.e., blank). Then when I create a `dataframe`, the empty fields change to `NaN` which are displayed after the last string in each column down to the length of the longest column in the dataframe (`list3` in my example). – milaske Dec 01 '18 at 23:00
1

With :

   master_list list1 list2 list3
0          abc   abc   abc   stu
1          def   xxx   def   zzz
2          ghi   xxx   yyy   zzz
3          jkl   xxx   yyy   zzz
4         mno1   xxx   yz1   zzz
5          pqr   xxx   NaN   zzz
6          stu   xxx   NaN   zzz
7          vwx   xxx   NaN   zzz
8          yz1   xxx   NaN   zzz
9          yz2   xx1   NaN   zzz
10         yz3   xx1   NaN   zzz
11         NaN   NaN   NaN   zzz
12         NaN   NaN   NaN  mno1
13         NaN   NaN   NaN   zzz

Use df.apply and np.isin :

df.apply(lambda col:np.isin(col,df.master_list).sum())
and 

or df.isin:

df.isin(df.master_list.dropna().tolist()).sum()

result:

master_list    11
list1           1
list2           3
list3           2
dtype: int64
B. M.
  • 18,243
  • 2
  • 35
  • 54
  • For additional insight, one of these solutions is preferred on larger dataframes. The dataframe that I need actual help with is much larger -- 10 columns and thousands of rows... Note the differences in time required to run both pieces of code on the larger dataframe. Using `df.apply` and `np.isin` takes 55.95 seconds. Using `df.isin` takes 0.83 seconds. – milaske Dec 01 '18 at 22:41
  • Yep! . I knew that. df.isin use fast indexing technics. Beautiful explanation here : https://stackoverflow.com/questions/50779617/pandas-pd-series-isin-performance-with-set-versus-array – B. M. Dec 02 '18 at 10:26
1

Create a list of not null values of master list and use dataframe.isin

master_list = df.loc[df['master_list'].notnull(), 'master_list'].tolist()
df.iloc[:, 1:].isin(master_list).sum()



list1    1
list2    3
list3    2
Vaishali
  • 37,545
  • 5
  • 58
  • 86
  • Unfortunately, this does not provide the correct solution. `list1 4, list2 12, list3 2` – milaske Dec 01 '18 at 20:54
  • This solution works. I am going to accept and give the check mark to a different answer because it only uses one line of code instead of two, but this code can certainly be used. – milaske Dec 01 '18 at 22:19
  • For additional insight, the dataframe that I need actual help with is much larger -- 10 columns and thousands of rows. Your code takes 0.90 seconds to run on the larger dataframe, barely longer than the code provided by @B.M. – milaske Dec 01 '18 at 22:30
1

Using numpy broadcast

s1=df.iloc[:,1:].values
s2=df['master_list'].values[:,None]
np.sum(s1==s2[:,None],1).sum(0)
Out[409]: array([1, 3, 2])
BENY
  • 317,841
  • 20
  • 164
  • 234
  • This code works on the dataframe in my example. The dataframe that I need actual help with is much larger -- 10 columns and thousands of rows. When I attempt to run your code on the larger dataframe, I receive the following error: `numpy.core._internal.AxisError: axis 1 is out of bounds for array of dimension 0` – milaske Dec 01 '18 at 22:22