I have a dataframe of companies, called slave <class 'pandas.core.frame.DataFrame'>
:
Date ID Name Company Email
2018 1 AB Apple apple@apple.com
2017 2 BC MicroSoft ms@ms.com
2016 3 CD Amazon amz@amz.com
the Company
column is an object type, not string.
and another big dataframe (2GB) of master data <class 'pandas.core.frame.DataFrame'>
:
code company_name product
123 MicroSoft Windows
456 Apple iphone
789 Amazon cloud
The type company_name
is object, same object type as the Company
, not string.
I want to look up every value in Company
value in the company_name
of the second DF and append the match to a series:
def finder(slave, master):
finalList = []
for company in slave['Company']:
if type(company) == 'some_specific_type':
for info in master['company_name']:
if company in info:
finalList.append(master.loc[str(info)]['code'])
break
else:
finalList.append(company)
return finalList
how can the look up process be much faster for this kind of situation? the master list is 20m lines and current loop is extremely slow.
slave['newCode'] = finder(slave, master) // very slow
the result will be:
Date ID Name Company Email newCode
2018 1 AB Apple apple@apple.com 456
2017 2 BC MicroSoft ms@ms.com 123
2016 3 CD Amazon amz@amz.com 789
i am checking for the type of the value, it is not a simple merge operation. The slave and master DF are not of the same size of rows, master is 20m rows, whereas the slave is just 1K rows. and i want to compare column company_name
, but get the corresponding row value from another column code
and store it as a new column in the slave table.
My goal: how can I use NumPy instead, using vectorization with NumPy arrays?