1

I have a dataframe of companies, called slave <class 'pandas.core.frame.DataFrame'>:

Date    ID  Name    Company     Email
2018    1   AB      Apple       apple@apple.com
2017    2   BC      MicroSoft   ms@ms.com
2016    3   CD      Amazon      amz@amz.com

the Company column is an object type, not string.

and another big dataframe (2GB) of master data <class 'pandas.core.frame.DataFrame'>:

code    company_name    product
123     MicroSoft       Windows
456     Apple           iphone
789     Amazon          cloud

The type company_name is object, same object type as the Company, not string.

I want to look up every value in Company value in the company_name of the second DF and append the match to a series:

def finder(slave, master):
    finalList = []
    for company in slave['Company']:
        if type(company) == 'some_specific_type':
            for info in master['company_name']:
                if company in info:
                    finalList.append(master.loc[str(info)]['code'])
                    break
        else:
            finalList.append(company)
    return finalList

how can the look up process be much faster for this kind of situation? the master list is 20m lines and current loop is extremely slow.

slave['newCode'] = finder(slave, master) // very slow

the result will be:

Date    ID  Name    Company     Email               newCode
2018    1   AB      Apple       apple@apple.com     456
2017    2   BC      MicroSoft   ms@ms.com           123
2016    3   CD      Amazon      amz@amz.com         789

i am checking for the type of the value, it is not a simple merge operation. The slave and master DF are not of the same size of rows, master is 20m rows, whereas the slave is just 1K rows. and i want to compare column company_name, but get the corresponding row value from another column code and store it as a new column in the slave table.

My goal: how can I use NumPy instead, using vectorization with NumPy arrays?

teebeetee
  • 549
  • 1
  • 8
  • 23
  • 1
    `i am checking for the type of the value, it is not a simple merge operation` Can you explain this..From what I have read..it looks left merge two df on `company` column and rename `code` to `newCode` – Rahul Agarwal Dec 12 '18 at 08:27
  • How is the data stored on-disk? How do you load it? Why don't you use an in-memory Database like sqlite for example? – haggi Dec 12 '18 at 08:28
  • @haggi the data is loaded from csv files. my goal is to use numpy vectorization – teebeetee Dec 12 '18 at 08:29
  • @RahulAgarwal merging is too slow in this situation, how can i use numpy vectorization instead? – teebeetee Dec 12 '18 at 08:30
  • May have been answered here (https://stackoverflow.com/questions/50101772/join-two-large-files-by-column-in-python) plus splitting slave based on type(company), then performing the join on the matching half and append the other half afterwards – steffen Dec 12 '18 at 09:32
  • Your previous (deleted) question by Vaishali should have answered your question. – cs95 Dec 12 '18 at 09:59

0 Answers0