-1

I have a weird raw data that contains multiple names in different ways and length. Something like:

data = [
'apple',
'apple;apple(big)',
'apple(apple),apple',
'banana(banana)',
'banana',
nan, # yes, there is some nan datas.
'cookie;cookie(cookie)',
'cookie(choco)']

The desired output is the Shortest valid name, in the demo case, output = ['apple', 'banana', 'cookie']

The way I think about is declaring a output =[] and iterate through data and compare if element exists in output, if not, append output; if exists, compare if both have similarities and return one with smaller length. But this seems very inefficient, and I don't know how to compare and get the smallest valid value.

I tried regex too but it failed since the valid result is randomly placed too. How do I complete the task ?

Cookie
  • 257
  • 1
  • 10
  • The code you have posted is NOT a valid python code. Post a **Valid** code and ask your question. – balderman Sep 23 '21 at 08:06
  • 1
    Explain how you acquire these data. The *nan* element makes no sense. Maybe it's None. We'd also be very interested to see your RE approach –  Sep 23 '21 at 08:06

2 Answers2

1

it looks like values between parens are never valid so (and as stated in comments you have to convert nans into none for valid python code)

import re
data=set(y for x in data if x for y in re.split('[,;]', x) if '(' not in y)
>>> {'apple', 'cookie', 'banana'}
diggusbickus
  • 1,537
  • 3
  • 7
  • 15
  • Yes, this works. Thank you. The nan data appears cause I scrap it from elements of a dataframe. – Cookie Sep 23 '21 at 08:34
  • @Cookie please add such information to your question. It might irrelevant to you, but could help us to find the best solution. – hc_dev Sep 23 '21 at 18:12
1

Assumption: data from pandas' DataFrame or Series

Stumbling about nan a comment revealed your data source as "elements of a dataframe". So I assume you can use pandas, too.

Steps

  1. removing NaN and reindex (for pandas only)
  2. clean data: e.g. 'cookie;cookie(cookie)' to 'cookie,cookie'
  3. split by comma and explode (add as rows): e.g. 'cookie,cookie' to separate rows ['cookie', 'cookie']

Solution integrated to pandas

For simplicity the data was defined as Series (1-dimensional list).

import pandas as pd
import numpy as np  # to inject NaN values

s = pd.Series([
'apple',
'apple;apple(big)',
'apple(apple),apple',
'banana(banana)',
'banana',
np.nan,  # fixed the syntax
'cookie;cookie(cookie)',
'cookie(choco)'
])

unique = (
    s.dropna()  # remove the NaN values
    .reset_index(drop=True)  # adjust the index as if NaN never existed
    .str.replace(r'\(.*[^)]?', '')  # replace parentheses-enclosed (incl. half-open) by empty string"
    .str.replace(r'[;.]', ',')  # replace semicolon or period by comma
    .str.split(',').explode()  # split to rows (add possible duplicate elements)
    .unique()  # reduce to unique
)

print(type(unique))
print(unique)

Prints:

<class 'numpy.ndarray'> ['apple' 'banana' 'cookie']

See

hc_dev
  • 8,389
  • 1
  • 26
  • 38