Efficient way of converting a numpy array of 2 dimensions into a list with no duplicates

Question

I want to extract the values from two different columns of a pandas dataframe, put them in a list with no duplicate values.

I have tried the following:

arr = df[['column1', 'column2']].values
thelist= []
    for ix, iy in np.ndindex(arr.shape):
        if arr[ix, iy] not in thelist:
            thelist.append(edges[ix, iy])

This works but it is taking too long. The dataframe contains around 30 million rows.

Example:

  column1 column2 
1   adr1   adr2   
2   adr1   adr2   
3   adr3   adr4   
4   adr4   adr5

Should generate the list with the values:

[adr1, adr2, adr3, adr4, adr5]

Can you please help me find a more efficient way of doing this, considering that the dataframe contains 30 million rows.

`np.unique(df.values)`. The default is to flatten arrays, so this does exactly what you want. — ALollz, Feb 21 '19 at 18:56
Possible duplicate of [pandas unique values multiple columns](https://stackoverflow.com/questions/26977076/pandas-unique-values-multiple-columns) — ALollz, Feb 21 '19 at 19:00
@ALollz is it normal that the contiguous order is not preserved? I need it to be contiguous. — alejo, Feb 21 '19 at 19:57
@alejo then try `pd.unique(df.values.ravel())`. `pd.unique` preserves order, while `np.unique` sorts — ALollz, Feb 21 '19 at 20:08

score 2 · Answer 1 · answered Feb 21 '19 at 18:59

2

@ALollz gave a right answer. I'll extend from there. To convert into list as expected just use list(np.unique(df.values))

answered Feb 21 '19 at 18:59

meW

3,832
7
27

Valdi_Bo · Answer 2 · 2019-02-21T20:21:54.680

1

You can use just np.unique(df) (maybe this is the shortest version).

Formally, the first parameter of np.unique should be an array_like object, but as I checked, you can also pass just a DataFrame.

Of course, if you want just plain list not a ndarray, write np.unique(df).tolist().

Edit following your comment

If you want the list unique but in the order of appearance, write:

pd.DataFrame(df.values.reshape(-1,1))[0].drop_duplicates().tolist()

Operation order:

reshape changes the source array into a single column.
Then a DataFrame is created, with default column name = 0.
Then [0] takes just this (the only) column.
drop_duplicates acts exactly what the name says.
And the last step: tolist converts to a plain list.

edited Feb 21 '19 at 20:21

answered Feb 21 '19 at 19:40

Valdi_Bo

30,023
4
23
41

Thanks @Valdi_Bo, is it normal that the contiguous order is not preserved? I need it to be contiguous. – alejo Feb 21 '19 at 19:56
Do you mean the *order of appearance* in the source table? The documentation states that if no *axis* has been given, then the input array is flattened (not sure about the order). Another step when the order can be changed is the *np.unique* function itself. It seems that the result has been sorted. – Valdi_Bo Feb 21 '19 at 20:04

Efficient way of converting a numpy array of 2 dimensions into a list with no duplicates

2 Answers2

Edit following your comment