I have a pandas dataframe as given below:
vname gname
0 Vishu Adhikari Haren Adhikari
1 Viswa Roy Galen Roy
2 Vishu Ray Deven Ray
3 Vasavi Ray Yogesh Ray
4 Vina Ray Suren Ray
... ... ...
226498 Umesh Ray Bhimachar Ray
226499 Umapada Roy Umesh Chandra Roy
226500 Upen Ray Bholanath Ray
226501 UTTAM ROY HARISH CHANDRA ROY
226502 USHA ROY CHHATRAPATI ROY
My goal is to categories each row as H or M and find all the names in the H category. Definition of H: all the names which have the word "roy" and all the names which have roy as part of the name and all the related names (we can ignore category M for now). Example: In Galen Roy, Galen will be categorised as H. Also all the names where "Galen" appears (in df['names'] column) need to be categorised as H and so on.
This is what I tried:
df.vname = df.vname.str.lower()
df.gname = df.gname.str.lower()
df['names'] = df['vname'] + ' ' + df['gname']
names_h = ['roy']
#get all names related to roy recursively
def get_relnames(df,name,names_rel):
for i,row in df.iterrows():
names = row['names'].split(' ') #gives list of words
if name in names:
for x in names:
if x not in names_rel and len(x) >2 and ')' not in x:
names_rel.append(x)
print(len(names_rel))
get_relnames(df,x,names_rel)
print(names_rel)
print(len(names_rel))
get_relnames(df,'roy',names_h)
But it is too slow and after an hour or so throws the error "RecursionError: maximum recursion depth exceeded in comparison". What is the best way to accomplish this? Any help would be greatly appreciated.
EDIT: Sample for 20 rows of dataframe.
names_h = ['roy']
vname gname names
0 vishu adhikari haren adhikari vishu adhikari haren adhikari
1 viswa roy galen roy viswa roy galen roy
2 vishu roy deven roy vishu roy deven roy
3 vasavi ray yogesh ray vasavi ray yogesh ray
4 vina ray suren ray vina ray suren ray
5 vimalkumar barman rajendr nath barman vimalkumar barman rajendr nath barman
6 vaishakhi ray jiten ray vaishakhi ray jiten ray
7 vishma dev adhikary haripada adhikary vishma dev adhikary haripada adhikary
8 vishu ray lakhiya ray vishu ray lakhiya ray
9 vivek roy lalit ch. roy vivek roy lalit ch. roy
10 vibhas singh ray niranjan simah ray vibhas singh ray niranjan simah ray
11 vijayakumar sarkar mahesh chandr sarkar vijayakumar sarkar mahesh chandr sarkar
12 vishu ray shrikant ray vishu ray shrikant ray
13 vihsma roy bishadu roy vihsma roy bishadu roy
14 vaswati roy sirish roy vaswati roy sirish roy
15 vishu adhikari gotalu adhikari vishu adhikari gotalu adhikari
16 vishmadeb barman bhaben barman vishmadeb barman bhaben barman
17 vina barman ramesh barman vina barman ramesh barman
18 vishu ray hemendranath ray vishu ray hemendranath ray
19 vishu das haleya das vishu das haleya das
Output:
names_h = ['roy','vishu','deven','adhikari','haren','viswa','galen','vivek','lalit','ray','shrikant','vihsma','bishadu','vaswati','sirish','gotalu','hemendranath','das','haleya','vasavi','yogesh','vina','suren','lakhiya','barman','bhaben','vishmadeb','vaswati','ramesh','rajendr','nath']
vishu is the list because it is related to roy (in names column in 3rd row), hemendranath is in the list because it is related to vishu (in the last but one row). Need to find all such related words recursively for the entire dataframe of 220k+ rows.
My code is working fine and giving expected output for smaller dataframe, but for large one its taking hours and finally crashing with error.