-4

I need to map thousands of product codes which are hand typed to actual product codes from the companies. The problem is that there are a ton of data entry errors while entering the product codes.

Now, I have to take all the hand typed product codes and map it to the actual product code given by the company. (P.S I have around 500,000 rows in my database and around 3000 actual product code list.)

Example: Suppose there is a product with the product code 123-A. In my database, I have 6 rows such as below:

  • 123A
  • A123
  • 12300
  • 123A
  • 123-A

When I do a vlookup in excel, I am able to map only the last row which is the exact match.

So, the process I am following now is divided into 4 steps as below:

  1. Do the vlookup for the exact match

  2. Remove the '-' and '.'(If any) from both the actual product code and the Database product code list and then do a vlookup.

  3. Remove all character and keep only numbers from both the actual product code and the Database product code list and then do a vlookup.

  4. Sum all the digits in the product code from both the actual product code and the Database product code list and then do a vlookup.

I want to recreate the same process in Python or R, where I can feed the actual product code list and the Database product code list and get the output.

Any help would be greatly appreciated.

  • 1
    This sounds a lot like: Please write some code for me, which is something you're likely to get downvotes for and/or get your question closed. Also, you didn't provide any sample data that could be used to formulate an answer. From experience, this will either result in no answers, or answers that require (endless) modifications, because it doesn't work for your particular case. See [How to make a minimal, reproducible question](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example-aka-mcve-minimal-complete-and-ver) – KenHBS Nov 28 '17 at 17:51

1 Answers1

0

In Excel, you can use 2 wildcard characters around the '123'.

"*"&123&"*"

Below is an image.

enter image description here

I tried a few things in base R; I couldn't get anything working.

ASH
  • 20,759
  • 19
  • 87
  • 200