3

I can't seem to find a question/answer for what I am looking for and it may be that I am just not asking the question correctly. Any help would be very much appreciated.

I have a pandas dataframe and I am trying to get only one of the combinations, I don't care about the order:

   ind   col0   
    1    [11908513152, 11646250552]    
    2    [11885390452, 15535908250]    
    3    [11505181152, 16840777350]   
    4    [10939963252, 21451188650]   
    5    [11794522952, 71374807803]  
    6    [11545148452, 19354003650]  
    7    [11849104552, 12114525052]  
    8    [15535681750, 11832504652]    
    9    [13120602349, 11281922352, 17273945153]   
    10   [11281922352, 17273945153, 13120602349]   
    11   [11646250552, 11908513152]    
    ... 

Line 10 has the same values of line 9, I only want one of them. Same for line 1 and 11.

ashley
  • 1,535
  • 1
  • 14
  • 19
  • 1
    Can you put the code the generate this dataframe in this question? Is that a pd.Series or a pd.Dataframe with one column? Is that a strings with comma's or is it a list? – Scott Boston Dec 18 '19 at 15:58
  • @ScottBoston I think its a pd.Dataframe - the one column contains the list of values as a series. ```df_from_each_file = (pd.read_csv(f, encoding='latin1') for f in all_files) concatenated_df = pd.concat(df_from_each_file, ignore_index=True, sort=False) matched_df.drop(columns=['col1', 'col2'], inplace=True) ``` – ashley Dec 18 '19 at 17:08
  • Can you share the data format in the file? – AMC Dec 18 '19 at 17:50

2 Answers2

0

What I will do split + explode then use duplicated

s=df.col0
yourdf=df[df.index.isin(s.str.split(', ').explode().duplicated().loc[lambda x : ~x].index)]
                                     xxxxx
0               11908513152, 11646250552  
1               11885390452, 15535908250  
2               11505181152, 16840777350  
3               10939963252, 21451188650  
4               11794522952, 71374807803  
5               11545148452, 19354003650  
6               11849104552, 12114525052  
7               15535681750, 11832504652  
8  13120602349, 11281922352, 17273945153  
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Sorry, I am not an expert and I am getting `NameError: name 's' is not defined` I know I have to loop through each of the rows in the Series but I don't know how to at the moment. I'll try to work it out. – ashley Dec 18 '19 at 17:19
  • Thank you for the update, getting a further error: `AttributeError: 'Series' object has no attribute 'explode'` – ashley Dec 18 '19 at 17:45
  • @ashley might you have forgotten the `()` in `explode()`? – AMC Dec 18 '19 at 17:51
  • Please see the image I just added above. I think I am getting it right, is there a typo I can't see? – ashley Dec 18 '19 at 17:56
  • Please help get me there - its so close - the output you have is exactly what I am looking for. But when I run this on the dataset it only gives me the first entry of the dataset. – ashley Dec 19 '19 at 14:29
0

I wasn't able to get @YOandBEN_W answer to work, although I am very grateful for the help. A big shout out to one of my friends (https://stackoverflow.com/users/12567056/ishan-patel) who sent me this:

import pandas as pd

my_data = {'col0':[ [0, 1.5], [2, 3], [1.5, 0]]}
df = pd.DataFrame(my_data)
out = df.col0.apply(lambda x: frozenset(x))
out.drop_duplicates()
ashley
  • 1,535
  • 1
  • 14
  • 19