I have sort of more general question on the process on working with text data. My goal is to create UNIQUE short labels/description on products from existing long descriptions based on specific rules.
In practice it looks like this. I get the data that you see in column Existing Long Description and based on rules and loops in python I changed it to the data in "New_Label" column.
Existing_Long_Description | New_Label |
---|---|
Edge protector clamping range 1-2 mm length 10 m width 6.5 mm height 9.5 mm blac | Edge protector BLACK RNG 1-2MM L=10M |
Edge protector clamping range 1-2 mm length 10 m width 6.5 mm height 9.5 mm red | Edge protector RED RNG 1-2MM L=10M |
This shortening to the desired format is not a problem. The problem starts when checking uniqueness of "New_label" column. Due to this shortening I might create duplicates:
Existing_Long_Description | New_Label |
---|---|
Draw-in collet chuck ER DIN 69871AD clamping dm 1-10MM SK40 projecting L=1 | Draw-in collet chuck dm 1-10MM |
Draw-in collet chuck ER DIN 69871AD clamping dm 1-10MM SK40 projecting L=6 | Draw-in collet chuck dm 1-10MM |
Draw-in collet chuck ER DIN 69871AD clamping dm 1-10MM SK40 projecting L=8 | Draw-in collet chuck dm 1-10MM |
To solve this I need to add some distinguishing factor to my New_Label column based on the difference in Existing_Long_Description.
The problem is that it might not be between unknown number of articles. I thought about following process:
- Identify the duplicates in Existing_Long_description = if there are duplicates, I will know those cant be solved in New_Label
- Identify the duplicates in New_Label column and if they are not in selection above = I know these can be solved
- For these that can be solved I need to run some distinguisher to find where they differ and extract this difference into other column to elaborate later on what to use to New_label column
Does what I want to do make sense? As I am doing it for the first time I am wondering - is there any way of working that you recommend me?
I read some articles like this: Find the similarity metric between two strings or elsewhere in stackoverflow I read about this: https://docs.python.org/3/library/difflib.html That I am planning to use but still it feels rather ineffective to me and maybe here is someone who can help me.
Thanks!