I have a csv Document with 2 columns which contains Commodity Category and Commodity Name.
Ex:
Sl.No. Commodity Category Commodity Name
1 Stationary Pencil
2 Stationary Pen
3 Stationary Marker
4 Office Utensils Chair
5 Office Utensils Drawer
6 Hardware Monitor
7 Hardware CPU
and I have another csv file which contains various Commodity names.
Ex:
Sl.No. Commodity Name
1 Pancil
2 Pencil-HB 02
3 Pencil-Apsara
4 Pancil-Nataraj
5 Pen-Parker
6 Pen-Reynolds
7 Monitor-X001RL
The output I would like is to standardise and categorise the commodity names and classify them into respective Commodity Categories like shown below :
Sl.No. Commodity Name Commodity Category
1 Pencil Stationary
2 Pencil Stationary
3 Pencil Stationary
4 Pancil Stationary
5 Pen Stationary
6 Pen Stationary
7 Monitor Hardware
Step 1) I first have to use NLTK (Text mining methods) and clean the data so as to seperate "Pencil" from "Pencil-HB 02" .
Step 2) After cleaning I have to use Approximate String match technique i.e agrep() to match the patterns "Pencil *" or correcting "Pancil" to "Pencil".
Step 3)Once correcting the pattern I have to categorise. No idea how.
This is what I have thought about. I started with step 2 and I'm stuck in step 2 only. I'm not finding an exact method to code this. Is there any way to get the output as required? If yes please suggest me the method I can proceed with.