1

I have the following dataframe in pandas:

     id              name categoryids    shops
5   239         Boulanger         [5]      152
3   196  Bouygues Telecom         [5]      500
4   122             Darty       [5,3]      363
1   311     Electro Dépôt         [5]       81
0  2336            Orange        [15]      578
2   194            Orange         [5]      577

I would like to drop the 5th row because it's duplicated in name but has a different value in the column categoryids, but as the values are arrays (as they can have more than one value), I have problem comparing them.

My idea was to check the mode of this column and discard all rows that don't have this value in its array (for example, in this case, the mode would be 5, so the 5th column should be discarded as this value is not present in its array), but I have problems calculating this value as the column is an array, not a single value.

Any ideas or suggestions on how to do this?

I'm using python 3.7 and last version of pandas.

Thank you.

jcf
  • 602
  • 1
  • 6
  • 26
  • So, you'll keep rows having the same `name` and `categoryids`? – vahdet Apr 26 '19 at 09:49
  • Can you create [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve)? Whats happen if multiple values in array and duplicated `name` ? – jezrael Apr 26 '19 at 09:55
  • @vahdet: I want to keep all rows with different names, and if there is a duplicated _name_, check if _categoryids_, and drop the row that doesn't have the doesn't have the mode value (the most repeated one across all rows) jezrael: I don't have any code yet. If there are multiple values in array and duplicated name, I would check if the mode value is in the array, and discard the row if it isn't. If it this, I would have to check that case, but I would probably combine the rows in some way or discard a random one, I don' t have that case prepared yet. – jcf Apr 26 '19 at 10:12

4 Answers4

3

First we can mark which rows in name column are duplicates.

Then we can unnest your arrays in categoryids with the function found in this answer.

Finally we filter on which rows are not marked as duplicate or are equal to the mode:

def unnest(df, tile, explode):
    vals = df[explode].sum(1)
    rs = [len(r) for r in vals]
    a = np.repeat(df[tile].values, rs, axis=0)
    b = np.concatenate(vals.values)
    d = np.column_stack((a, b))
    return pd.DataFrame(d, columns = tile +  ['_'.join(explode)])

# Mark duplicate rows
df['dups'] = df.name.duplicated(keep=False).astype(int)
# Unnest categoryids column
df2 = unnest(df, ['id', 'name', 'shops', 'dups'], ['categoryids'])

print(df2)
     id              name shops dups categoryids
0   239         Boulanger   152    0           5
1   196  Bouygues Telecom   500    0           5
2   122             Darty   363    0           5
3   122             Darty   363    0           3
4   311     Electro Dépôt    81    0           5
5  2336            Orange   578    1          15
6   194            Orange   577    1           5

Filter the duplicate rows which are not equal to the mode:

mode = df2['categoryids'].mode()

df2 = df2[~df2['dups'].eq(1) | df2['categoryids'].isin(mode)].drop('dups', axis=1)

print(df2)
    id              name shops categoryids
0  239         Boulanger   152           5
1  196  Bouygues Telecom   500           5
2  122             Darty   363           5
3  122             Darty   363           3
4  311     Electro Dépôt    81           5
6  194            Orange   577           5

Optionally We can groupby on name to get the arrays back:


df2 = df2.groupby('name').agg({'id':'first',
                               'shops':'first',
                              'categoryids':list}).reset_index()

print(df2)
               name   id  shops categoryids
0         Boulanger  239    152         [5]
1  Bouygues Telecom  196    500         [5]
2             Darty  122    363      [5, 3]
3     Electro Dépôt  311     81         [5]
4            Orange  194    577         [5]
Erfan
  • 40,971
  • 8
  • 66
  • 78
1

With a DataFrame like this:

df = pd.DataFrame({'id': [239,196,122,311,2336,194,],
'name': ['Boulanger','Bouygues Telecom','Darty','Electro Dépôt','Orange','Orange',],
'shops': [152, 500, 363, 81, 578, 577,],
'categoryids': [[5],[5],[5,3],[5],[15],[5],]})

You can do:

df.sort_values('categoryids').drop_duplicates('name', keep='first')

Which sorts the categoryids column, then drops the duplicates in name and keeps the first of it.

EDIT:

Another thing you can do is to check whether the value you are after in the categoryids column exist:

df["exist"] = [int(5 in r)  for r in df["categoryids"]]

Which will give you:

    id              name                shops   categoryids exist
0   239             Boulanger             152            [5]    1
1   196             Bouygues Telecom      500            [5]    1
2   122             Darty                 363         [5, 3]    1
3   311             Electro Dépôt          81            [5]    1
4   2336            Orange                578           [15]    0
5   194             Orange                577            [5]    1

And then you can take only the ones that exist:

df[df['exist'] == 1]

Then find the duplicates using pd.duplicated() as @Erfan mentioned:

df['dups'] = df['name'].duplicated(keep=False).astype(int)


    id  name               shops     categoryids    exist   dups
0   239 Boulanger            152             [5]    1          0
1   196 Bouygues Telecom     500             [5]    1          0
2   122 Darty                363          [5, 3]    1          0
3   311 Electro Dépôt         81             [5]    1          0
4   2336    Orange           578            [15]    0          1
5   194 Orange               577             [5]    1          1


df[(
    (df['dups']!=1) | 
    (df['exist']!=0)
)].drop(['exist', 'dups'], axis=1).reset_index()

Would result in:

index   id  name               shops    categoryids
0   0   239 Boulanger            152    [5]
1   1   196 Bouygues Telecom     500    [5]
2   2   122 Darty                363    [5, 3]
3   3   311 Electro Dépôt         81    [5]
4   5   194 Orange               577    [5]
VnC
  • 1,936
  • 16
  • 26
  • Does this take into account that rows that are **not** a duplicate but also not a mode? – Erfan Apr 26 '19 at 10:37
  • I forgot to say, in that case, no action would be required (a priori, maybe I have to check if another action need to be performed). If a row does not have a duplicate name but also a _categoryids_ not equal to the mode, it should be preserved (as I only use the _categoryids_ field to differenciate between rows with equal names) – jcf Apr 26 '19 at 10:57
0

You can try:

df = df.drop_duplicates(subset = ['name'])

This will look at duplicates only in the column name. You can combine columns by adding other column names to the subset list.

Jorge
  • 2,181
  • 1
  • 19
  • 30
  • 1
    This will not work since it wont keep in account which value is the mode in the categoryids – Erfan Apr 26 '19 at 10:13
0

I managed to do it using @VnC answer, with some modifications, as I thought that the arrays of the categoryids where actual arrays of integers (as in the example above), but I found that they were strings (not arrays of strings, but plain strings):

retailersIds_df = get_dataframe() # external method to get the dataframe, not relevant
retailersIds_df['categoryids'] = retailersIds_df['categoryids'].str.replace('[', '')
retailersIds_df['categoryids'] = retailersIds_df['categoryids'].str.replace(']', '')
retailersIds_df['categoryids'] = retailersIds_df['categoryids'].str.split(',')
# the following lines are used to calculate the mode of all the values contained in the arrays. 

ids_aux = []
for row in retailersIds_df.itertuples():
    ids_aux = ids_aux + row.categoryids
mydict = Counter(ids_aux)
mode = [key for key, value in mydict.items() if value ==  max(mydict.values())][0] 

# the counter module returns a dict, and the key (the actual value) of the most repeated value is chosen.
#the [0] is for the case where two keys have the same value, and the first is chosen (arbitrarily)

retailersIds_df["exist"] = [int(mode in r) for r in retailersIds_df["categoryids"]]
retailersIds_df = retailersIds_df[retailersIds_df['exist'] == 1]

The loop is made to calculate the mode, although a better option probably exists (I know loops shouldn't be done in a pandas dataframe, but I couldn't think of a other option, considering the arrays can be of an arbitrary length)

Alec
  • 8,529
  • 8
  • 37
  • 63
jcf
  • 602
  • 1
  • 6
  • 26