I need to map thousands of product codes which are hand typed to actual product codes from the companies. The problem is that there are a ton of data entry errors while entering the product codes.
Now, I have to take all the hand typed product codes and map it to the actual product code given by the company. (P.S I have around 500,000 rows in my database and around 3000 actual product code list.)
Example: Suppose there is a product with the product code 123-A. In my database, I have 6 rows such as below:
- 123A
- A123
- 12300
- 123A
- 123-A
When I do a vlookup in excel, I am able to map only the last row which is the exact match.
So, the process I am following now is divided into 4 steps as below:
Do the vlookup for the exact match
Remove the '-' and '.'(If any) from both the actual product code and the Database product code list and then do a vlookup.
Remove all character and keep only numbers from both the actual product code and the Database product code list and then do a vlookup.
Sum all the digits in the product code from both the actual product code and the Database product code list and then do a vlookup.
I want to recreate the same process in Python or R, where I can feed the actual product code list and the Database product code list and get the output.
Any help would be greatly appreciated.