How to speed up the matching of columns in pandas dataframe

Question

I'm trying to find matching values in a pandas dataframe. Once a match is found I want to perform some operations on the row of the dataframe.

Currently I'm using this Code:

import pandas as pd

d = {'child_id': [1, 2,5,4], 'parent_id': [3, 4,2,3], 'content':     ["a","b","c","d"]}

df = pd.DataFrame(data=d)

for i in range(len(df)):

        for j in range(len(df)):

            if str(df['child_id'][j]) == str(df['parent_id'][i]):
                print(df.content[i])
            else:
                pass

It works fine, but is rather slow. Since I'm dealing with a dataset with millions of rows, I would take months. Is there a faster way to do this?

Edit: To clarify what, I want to create is a dataframe, which contains the Content of Matches.

import pandas as pd

d = {'child_id': [1,2,5,4],
 'parent_id': [3,4,2,3],
 'content': ["a","b","c","d"]}

df = pd.DataFrame(data=d)

df2 = pd.DataFrame(columns = ("content_child", "content_parent"))

for i in range(len(df)):

    for j in range(len(df)):

        if str(df['child_id'][j]) == str(df['parent_id'][i]):
            content_child = str(df["content"][i])

            content_parent = str(df["content"][j])

            s = pd.Series([content_child, content_parent], index=['content_child', 'content_parent'])
            df2 = df2.append(s, ignore_index=True)
        else:
            pass

 print(df2)

What are the "some operation" you are referring to. What are you trying to do when you find a match? — Erfan, Mar 30 '19 at 18:06
do you mean `df.loc[df.parent_id.isin(df.child_id),'content']` ?? if not can you explain what exactly are you trying to do with a final expected dataframe? may be loops are not required for this.. — anky, Mar 30 '19 at 18:06
I want to extract the value of the column "Content" of row i and row j, if a match is found. — zacha2, Mar 30 '19 at 18:09
@zacha2 so match in any rows or in the same row? if same row i think you need `df.loc[df.parent_id.eq(df.child_id),'content']` else my previous comment — anky, Mar 30 '19 at 18:10
@anky_91 That works. But I Need to the Content from the row with child_id and the Content. I edited the orginal post to clarify what I want to achvie. — zacha2, Mar 30 '19 at 18:39
@zacha2 df.loc[df.parent_id.eq(df.child_id),['child_id',content']] ?? if not `eq` put isin — anky, Mar 30 '19 at 18:47
With isin it works, but this only Returns the Content from child_id, but not from the corresponding parent_id @anky_91 — zacha2, Mar 30 '19 at 19:06
@zacha2 if possible post another question with all these codes you have tried. and relevant data — anky, Mar 30 '19 at 19:07
use `join operation` https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html — Gor, Mar 30 '19 at 19:37
@jezrael this is not the duplicate as you marked. please remove it. — Gor, Mar 30 '19 at 20:02
@Gor - Is created new [question](https://stackoverflow.com/questions/55435139/matching-and-extracting-values-from-pandas-dataframe) be free answer. — jezrael, Mar 30 '19 at 20:03
@jezrael OK but can You also remove duplicate mark from this question too ? This way we are making trash on stackoverflow. — Gor, Mar 30 '19 at 20:06
@Gor - In my opinion if not working `isin` then not working `join`, so your solution is also bad. If remove dupe, then create dupe of new question? Or what do you think the best is? — jezrael, Mar 30 '19 at 20:09
@jezrael the duplicate marking is not associated with my answer or with any answer. The best solution is first make sure that questions is duplicate, then mark it. I think everyone understand that you want to earn more reputation this way, but please do not hurry and do not make wrong decisions. — Gor, Mar 30 '19 at 20:14
@Gor - If not creatred new question, I reopen it. So if reopened, then is necessery close new question. So better is nothing change in my opinion. But maybe I am wrong, is possible explain why is better have 2 question with same content if this question is reopened? — jezrael, Mar 30 '19 at 20:24

MartinKondor · Answer 1 · 2019-03-30T18:41:53.163

The fastest way is to use the features of numpy:

import pandas as pd


d = {
  'child_id': [1, 2, 5, 4],
  'parent_id': [3, 4, 2, 3],
  'content': ["a", "b", "c", "d"]
}
df = pd.DataFrame(data=d)

comp1 = df['child_id'].values == df['parent_id'].values
comp2 = df['child_id'].values[::-1] == df['parent_id'].values
comp3 = df['child_id'].values == df['parent_id'].values[::-1]

if comp1.any() and not comp2.any() and not comp3.any():
  comp = np.c_[ df['content'].values[comp1] ]
elif comp1.any() and comp2.any() and not comp3.any():
  comp = np.c_[ df['content'].values[comp1], df['content'].values[comp2] ]
elif comp1.any() and comp2.any() and comp3.any():
  comp = np.c_[ df['content'].values[comp1], df['content'].values[comp2], df['content'].values[comp3] ]

print( df['content'].values[comp] )

Which outputs:

[]

Output should be [b,c] Do you mean comp = df['child_id'].values == df['parent_id'].values print( df['content'].values[comp] ) This Returns [] — zacha2, Mar 30 '19 at 18:22

How to speed up the matching of columns in pandas dataframe

1 Answers1