0

I have a pandas dataframe as given below:

                 vname               gname
0       Vishu Adhikari      Haren Adhikari
1            Viswa Roy           Galen Roy
2            Vishu Ray           Deven Ray
3           Vasavi Ray          Yogesh Ray
4             Vina Ray           Suren Ray
...                ...                 ...
226498       Umesh Ray       Bhimachar Ray
226499     Umapada Roy   Umesh Chandra Roy
226500        Upen Ray       Bholanath Ray
226501       UTTAM ROY  HARISH CHANDRA ROY
226502        USHA ROY     CHHATRAPATI ROY

My goal is to categories each row as H or M and find all the names in the H category. Definition of H: all the names which have the word "roy" and all the names which have roy as part of the name and all the related names (we can ignore category M for now). Example: In Galen Roy, Galen will be categorised as H. Also all the names where "Galen" appears (in df['names'] column) need to be categorised as H and so on.

This is what I tried:

df.vname = df.vname.str.lower()
df.gname = df.gname.str.lower()
df['names'] = df['vname'] + ' ' + df['gname']
names_h = ['roy']

#get all names related to roy recursively

def get_relnames(df,name,names_rel):
    
    for i,row in df.iterrows():
        names = row['names'].split(' ') #gives list of words
        if name in names:
            for x in names:
                if x not in names_rel and len(x) >2 and ')' not in x:
                    names_rel.append(x)
                    print(len(names_rel))
                    get_relnames(df,x,names_rel)

    print(names_rel)
    print(len(names_rel))

get_relnames(df,'roy',names_h)

But it is too slow and after an hour or so throws the error "RecursionError: maximum recursion depth exceeded in comparison". What is the best way to accomplish this? Any help would be greatly appreciated.

EDIT: Sample for 20 rows of dataframe.

names_h = ['roy']

                  vname                 gname                                    names
0        vishu adhikari        haren adhikari            vishu adhikari haren adhikari
1             viswa roy             galen roy                      viswa roy galen roy
2             vishu roy             deven roy                      vishu roy deven roy
3            vasavi ray            yogesh ray                    vasavi ray yogesh ray
4              vina ray             suren ray                       vina ray suren ray
5     vimalkumar barman   rajendr nath barman    vimalkumar barman rajendr nath barman
6         vaishakhi ray             jiten ray                  vaishakhi ray jiten ray
7   vishma dev adhikary     haripada adhikary    vishma dev adhikary haripada adhikary
8             vishu ray           lakhiya ray                    vishu ray lakhiya ray
9             vivek roy         lalit ch. roy                  vivek roy lalit ch. roy
10     vibhas singh ray    niranjan simah ray      vibhas singh ray niranjan simah ray
11   vijayakumar sarkar  mahesh chandr sarkar  vijayakumar sarkar mahesh chandr sarkar
12            vishu ray          shrikant ray                   vishu ray shrikant ray
13           vihsma roy           bishadu roy                   vihsma roy bishadu roy
14          vaswati roy            sirish roy                   vaswati roy sirish roy
15       vishu adhikari       gotalu adhikari           vishu adhikari gotalu adhikari
16     vishmadeb barman         bhaben barman           vishmadeb barman bhaben barman
17          vina barman         ramesh barman                vina barman ramesh barman
18            vishu ray      hemendranath ray               vishu ray hemendranath ray
19            vishu das            haleya das                     vishu das haleya das

Output:

names_h = ['roy','vishu','deven','adhikari','haren','viswa','galen','vivek','lalit','ray','shrikant','vihsma','bishadu','vaswati','sirish','gotalu','hemendranath','das','haleya','vasavi','yogesh','vina','suren','lakhiya','barman','bhaben','vishmadeb','vaswati','ramesh','rajendr','nath']

vishu is the list because it is related to roy (in names column in 3rd row), hemendranath is in the list because it is related to vishu (in the last but one row). Need to find all such related words recursively for the entire dataframe of 220k+ rows.

My code is working fine and giving expected output for smaller dataframe, but for large one its taking hours and finally crashing with error.

Naveed
  • 522
  • 6
  • 22
  • The list names_h should be appended with all the names (words) where "roy" is related. In "galen roy", galen will be appended to name_h. And then all the names where galen appears like galen abc, galen xyz, then names_h will look like ['roy','galen','xyz','abc']. This should go on recursively till all the related names (words) are appended to names_h – Naveed Jun 17 '21 at 14:11
  • @Naveed please put your exact expected output in your question – tomjn Jun 17 '21 at 14:16
  • My dataframe has 220K+ rows. Expected output will run into a list of thousands of words, its not possible to produce it manually. – Naveed Jun 17 '21 at 14:17
  • 1
    @Naveed Then do it for 20 rows ([and read this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples)) – tomjn Jun 17 '21 at 14:18
  • Added sample output for 20 rows – Naveed Jun 17 '21 at 14:31
  • @Corralien Sorry I missed the word "deven", added it now. – Naveed Jun 17 '21 at 15:35
  • @Corralien ray is there already. Also I have added all the missing words – Naveed Jun 17 '21 at 15:38
  • What about "nath" and "rajendr"? I see that it in index 5 they appear with "barman" (which is included). What am I missing? – tomjn Jun 17 '21 at 15:48
  • @tomjn Extremely sorry, I have added those 2 words too now. You have correctly understood the problem. – Naveed Jun 17 '21 at 15:54
  • Ok - I also see "jiten", "simah", "niranjan", "vibhas", "vaishakhi" and "singh" appearing with "ray" and "vimalkumar" with "barman". What do you want to do with "ch.", which appears with "roy"? – tomjn Jun 17 '21 at 15:58
  • "ch" will have to be ignored because it has less than 3 characters. Only words with 3 or more characters count. Please overlook any words I may have missed in the expected output. – Naveed Jun 17 '21 at 16:00
  • "ch" cannot be ignored because "ch." has 3 characters. – Corralien Jun 17 '21 at 16:04
  • @Corralien dot and other non alphabetic characters will be removed from the strings in actual program. – Naveed Jun 17 '21 at 16:06

2 Answers2

1

You can use networkx.

Build all pair of names (itertools.combinations) , that represents edges of the graph. Now, you have to add egdes but you have to flatten the list of lists of tuples to a simple list of tuples (itertoos.chain.from_iterable). Now, you have just to get connected nodes from the node 'roy' with node_connected_component.

import itertools
import networkx as nx
from networkx.algorithms.components import node_connected_component

edges = df['names'].str.findall(r'\w{3,}') \
                   .apply(lambda x: list(itertools.combinations(x, 2)))
G = nx.Graph()
G.add_edges_from(itertools.chain.from_iterable(edges))
names_h = node_connected_component(G, 'roy')
>>> names_h
{'adhikari',
 'barman',
 'bhaben',
 'bishadu',
 'das',
 'deven',
 'galen',
 'gotalu',
 'haleya',
 'haren',
 'hemendranath',
 'jiten',
 'lakhiya',
 'lalit',
 'nath',
 'niranjan',
 'rajendr',
 'ramesh',
 'ray',
 'roy',
 'shrikant',
 'simah',
 'singh',
 'sirish',
 'suren',
 'vaishakhi',
 'vasavi',
 'vaswati',
 'vibhas',
 'vihsma',
 'vimalkumar',
 'vina',
 'vishmadeb',
 'vishu',
 'viswa',
 'vivek',
 'yogesh'}

Draw:

import matplotlib.pyplot as plt
plt.subplot()
nx.draw(G, with_labels=True, font_weight='bold')
plt.show()

enter image description here

Corralien
  • 109,409
  • 8
  • 28
  • 52
  • Amazing! In the lambda function, need to ignore words containing non alphabetic characters. Can you provide that please? – Naveed Jun 17 '21 at 16:09
  • If possible can you please explain the logic behind this? Its super-fast for large dataframe. – Naveed Jun 17 '21 at 16:19
  • I have observed that whatever name I input (in place of roy) output is the same for main dataframe of 220k rows. That shouldnt be the case. – Naveed Jun 17 '21 at 16:28
0

Here is another way using pandas. I'm not sure how it will scale up to 220k rows.

all_names = df["names"].str.lower().str.split().explode()

targets = ["roy"]
n_old = None
while len(targets) != n_old:   
    n_old = len(targets)
    targets = all_names[all_names.isin(targets).groupby(level=0).any()]
    targets = targets.unique()
print(targets)

The idea is to rely on pandas string methods rather than having a manual loop which will almost always be slow. I didn't remove "ch." I'll leave you to do that.

tomjn
  • 5,100
  • 1
  • 9
  • 24