0

I have a dataframe where i'd like to add a column "exists" based on the item existing in another dataframe.

Using the isin function only answers back with 1 match based on that other dataframe. Same for a loc filter when i set the column i want to filter as index.

It just doesn't work as expected when i use a reference to a list or column of another DF like this:

table.loc[table.index.isin(tableOther['column']), : ]

In this case it only returns 1 item.

import pandas as pd
import numpy as np

# Source that i like to enrich with additional column
table = pd.read_csv('keywordsDataSource.csv', encoding='utf-8', delimiter=';', index_col='Keyword') 

# Source to compare keywords against 
tableSubject = pd.read_csv('subjectDataSource.csv', encoding='utf-8', names=["subjects"])

### This column based check only returns 1 - seemingly random - match ### 
table.loc[table.index.isin(tableSubject['subjects']), : ]


--------------

######## also tried it like this:

# Source that i like to enrich with additional column
table = pd.read_csv('keywordsDataSource.csv', encoding='utf-8', delimiter=';') 

# Source to compare keywords against 
tableSubject = pd.read_csv('subjectDataSource.csv', encoding='utf-8', names=["subjects"])

mask = table['Keyword'].isin(tableSubject.subjects)
table[mask]


I've also tried using .query and turning the pd subject column to a list which ends with the same singular match result as above.

as the output is the same in all tries, I expect that it is something with the datasource..

Thank you for your thoughts!

Michel K
  • 641
  • 1
  • 6
  • 18

1 Answers1

0

Found the answer to be as simple as capitalization of words. Both sources of data were not set in lower characters. One list had Capitalized Words Like This and the other was random.

Learning: Make sure to set columns to be exactly the same as all options for matching look for exact matches.

This can be done as following:

table['Keyword'] = table['Keyword'].str.lower()

Also found a great answer here in case you don't need exact match:

How to test if a string contains one of the substrings in a list, in pandas?

Michel K
  • 641
  • 1
  • 6
  • 18