0

I want to extract the values from two different columns of a pandas dataframe, put them in a list with no duplicate values.

I have tried the following:

arr = df[['column1', 'column2']].values
thelist= []
    for ix, iy in np.ndindex(arr.shape):
        if arr[ix, iy] not in thelist:
            thelist.append(edges[ix, iy])

This works but it is taking too long. The dataframe contains around 30 million rows.

Example:

  column1 column2 
1   adr1   adr2   
2   adr1   adr2   
3   adr3   adr4   
4   adr4   adr5   

Should generate the list with the values:

[adr1, adr2, adr3, adr4, adr5]

Can you please help me find a more efficient way of doing this, considering that the dataframe contains 30 million rows.

rpanai
  • 12,515
  • 2
  • 42
  • 64
alejo
  • 127
  • 2
  • 11
  • 2
    `np.unique(df.values)`. The default is to flatten arrays, so this does exactly what you want. – ALollz Feb 21 '19 at 18:56
  • `list(np.unique(df.to_numpy())` – kudeh Feb 21 '19 at 18:59
  • Possible duplicate of [pandas unique values multiple columns](https://stackoverflow.com/questions/26977076/pandas-unique-values-multiple-columns) – ALollz Feb 21 '19 at 19:00
  • @ALollz is it normal that the contiguous order is not preserved? I need it to be contiguous. – alejo Feb 21 '19 at 19:57
  • 1
    @alejo then try `pd.unique(df.values.ravel())`. `pd.unique` preserves order, while `np.unique` sorts – ALollz Feb 21 '19 at 20:08

2 Answers2

2

@ALollz gave a right answer. I'll extend from there. To convert into list as expected just use list(np.unique(df.values))

meW
  • 3,832
  • 7
  • 27
1

You can use just np.unique(df) (maybe this is the shortest version).

Formally, the first parameter of np.unique should be an array_like object, but as I checked, you can also pass just a DataFrame.

Of course, if you want just plain list not a ndarray, write np.unique(df).tolist().

Edit following your comment

If you want the list unique but in the order of appearance, write:

pd.DataFrame(df.values.reshape(-1,1))[0].drop_duplicates().tolist()

Operation order:

  • reshape changes the source array into a single column.
  • Then a DataFrame is created, with default column name = 0.
  • Then [0] takes just this (the only) column.
  • drop_duplicates acts exactly what the name says.
  • And the last step: tolist converts to a plain list.
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • Thanks @Valdi_Bo, is it normal that the contiguous order is not preserved? I need it to be contiguous. – alejo Feb 21 '19 at 19:56
  • Do you mean the *order of appearance* in the source table? The documentation states that if no *axis* has been given, then the input array is flattened (not sure about the order). Another step when the order can be changed is the *np.unique* function itself. It seems that the result has been sorted. – Valdi_Bo Feb 21 '19 at 20:04