Remove complex duplicated list

Question

I have a weird raw data that contains multiple names in different ways and length. Something like:

data = [
'apple',
'apple;apple(big)',
'apple(apple),apple',
'banana(banana)',
'banana',
nan, # yes, there is some nan datas.
'cookie;cookie(cookie)',
'cookie(choco)']

The desired output is the Shortest valid name, in the demo case, output = ['apple', 'banana', 'cookie']

The way I think about is declaring a output =[] and iterate through data and compare if element exists in output, if not, append output; if exists, compare if both have similarities and return one with smaller length. But this seems very inefficient, and I don't know how to compare and get the smallest valid value.

I tried regex too but it failed since the valid result is randomly placed too. How do I complete the task ?

The code you have posted is NOT a valid python code. Post a **Valid** code and ask your question. — balderman, Sep 23 '21 at 08:06
Explain how you acquire these data. The *nan* element makes no sense. Maybe it's None. We'd also be very interested to see your RE approach — , Sep 23 '21 at 08:06

score 1 · Answer 1 · answered Sep 23 '21 at 08:19

1

it looks like values between parens are never valid so (and as stated in comments you have to convert nans into none for valid python code)

import re
data=set(y for x in data if x for y in re.split('[,;]', x) if '(' not in y)
>>> {'apple', 'cookie', 'banana'}

answered Sep 23 '21 at 08:19

diggusbickus

1,537
3
7
15

Yes, this works. Thank you. The nan data appears cause I scrap it from elements of a dataframe. – Cookie Sep 23 '21 at 08:34
@Cookie please add such information to your question. It might irrelevant to you, but could help us to find the best solution. – hc_dev Sep 23 '21 at 18:12

score 1 · Answer 2 · answered Sep 23 '21 at 20:24

Assumption: data from pandas' DataFrame or Series

Stumbling about nan a comment revealed your data source as "elements of a dataframe". So I assume you can use pandas, too.

Steps

removing NaN and reindex (for pandas only)
clean data: e.g. 'cookie;cookie(cookie)' to 'cookie,cookie'
split by comma and explode (add as rows): e.g. 'cookie,cookie' to separate rows ['cookie', 'cookie']

Solution integrated to pandas

For simplicity the data was defined as Series (1-dimensional list).

import pandas as pd
import numpy as np  # to inject NaN values

s = pd.Series([
'apple',
'apple;apple(big)',
'apple(apple),apple',
'banana(banana)',
'banana',
np.nan,  # fixed the syntax
'cookie;cookie(cookie)',
'cookie(choco)'
])

unique = (
    s.dropna()  # remove the NaN values
    .reset_index(drop=True)  # adjust the index as if NaN never existed
    .str.replace(r'\(.*[^)]?', '')  # replace parentheses-enclosed (incl. half-open) by empty string"
    .str.replace(r'[;.]', ',')  # replace semicolon or period by comma
    .str.split(',').explode()  # split to rows (add possible duplicate elements)
    .unique()  # reduce to unique
)

print(type(unique))
print(unique)

Prints:

<class 'numpy.ndarray'> ['apple' 'banana' 'cookie']

See

pandas User Guide: Working with text data
Split cell into multiple rows in pandas dataframe

Thank you, this is very clear solution too! The comment really helps understanding a lot — Cookie, Sep 24 '21 at 06:25

Remove complex duplicated list

2 Answers2

Assumption: data from pandas' DataFrame or Series

Steps

Solution integrated to pandas