Getting rid of duplicates in text strings in new column by identifying differences in original data and using this difference in new column

Question

I have sort of more general question on the process on working with text data. My goal is to create UNIQUE short labels/description on products from existing long descriptions based on specific rules.

In practice it looks like this. I get the data that you see in column Existing Long Description and based on rules and loops in python I changed it to the data in "New_Label" column.

Existing_Long_Description	New_Label
Edge protector clamping range 1-2 mm length 10 m width 6.5 mm height 9.5 mm blac	Edge protector BLACK RNG 1-2MM L=10M
Edge protector clamping range 1-2 mm length 10 m width 6.5 mm height 9.5 mm red	Edge protector RED RNG 1-2MM L=10M

This shortening to the desired format is not a problem. The problem starts when checking uniqueness of "New_label" column. Due to this shortening I might create duplicates:

Existing_Long_Description	New_Label
Draw-in collet chuck ER DIN 69871AD clamping dm 1-10MM SK40 projecting L=1	Draw-in collet chuck dm 1-10MM
Draw-in collet chuck ER DIN 69871AD clamping dm 1-10MM SK40 projecting L=6	Draw-in collet chuck dm 1-10MM
Draw-in collet chuck ER DIN 69871AD clamping dm 1-10MM SK40 projecting L=8	Draw-in collet chuck dm 1-10MM

To solve this I need to add some distinguishing factor to my New_Label column based on the difference in Existing_Long_Description.

The problem is that it might not be between unknown number of articles. I thought about following process:

Identify the duplicates in Existing_Long_description = if there are duplicates, I will know those cant be solved in New_Label
Identify the duplicates in New_Label column and if they are not in selection above = I know these can be solved
For these that can be solved I need to run some distinguisher to find where they differ and extract this difference into other column to elaborate later on what to use to New_label column

Does what I want to do make sense? As I am doing it for the first time I am wondering - is there any way of working that you recommend me?

I read some articles like this: Find the similarity metric between two strings or elsewhere in stackoverflow I read about this: https://docs.python.org/3/library/difflib.html That I am planning to use but still it feels rather ineffective to me and maybe here is someone who can help me.

Thanks!

score 0 · Answer 1 · answered Jan 08 '22 at 20:12

A relational database would be a good fit for this problem, with appropriate UNIQUE indexes configured. But let's assume you're going to solve it in memory, rather than on disk. Assume that get_longs() will read long descriptions from your data source.

dup long descriptions

Avoid processing like this:

longs = []
for long in get_longs():
    if long not in longs:
        longs.append(long)

Why?

It is quadratic, running in O(N^2) time, for N descriptions. Each in takes linear O(N) time, and we perform N such operations on the list. To process 1000 parts would regrettably require a million operations.

Instead, take care to use an appropriate data structure, a set:

longs = set(get_longs())

That's enough to quickly de-dup the long descriptions, in linear time.

dup short descriptions

Now the fun begins. You explained that you already have a function that works like a champ. But we must adjust its output in the case of collisions.

class Dedup:

    def __init__(self):
        self.short_to_long = {}

    def get_shorts(self):
        """Produces unique short descriptions."""
        for long in sorted(set(get_longs())):
            short = summary(long)
            orig_long = self.short_to_long.get(short)
            if orig_long:
                short = self.deconflict(short, orig_long, long)
            self.short_to_long[short] = long
            yield short            

    def deconflict(self, short, orig_long, long):
        """Produces a novel short description that won't conflict with existing ones."""
        for word in sorted(set(long.split()) - set(orig_long.split())):
            short += f' {word}'
            if short not in self.short_to_long:  # Yay, we win!
                return short
        # Boo, we lose.
        raise ValueError(f"Sorry, can't find a good description: {short}\n{orig_long}\n{long}")

The expression that subtracts one set from another is answering the question, "What words in long would help me to uniqueify this result?" Now of course, some of them may have already been used by other short descriptions, so we take care to check for that.

Given several long descriptions that collide in the way you're concerned about, the 1st one will have the shortest description, and ones appearing later will tend to have longer "short" descriptions.

The approach above is a bit simplistic, but it should get you started. It does not, for example, distinguish between "claw hammer" and "hammer claw". Both strings survive initial uniqueification, but then there's no more words to help with deconflicting. For your use case the approach above is likely to be "good enough".

Getting rid of duplicates in text strings in new column by identifying differences in original data and using this difference in new column

1 Answers1

dup long descriptions

dup short descriptions