16

The question was originally asked here as a comment but could not get a proper answer as the question was marked as a duplicate.

For a given pandas.DataFrame, let us say

df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
df

     A   B
0    5   1
1    6   2
2    3   3
3    4   5

How can we select rows from a list, based on values in a column ('A' for instance)

For instance

# from
list_of_values = [3,4,6]

# we would like, as a result
#      A   B
# 2    3   3
# 3    4   5
# 1    6   2

Using isin as mentioned here is not satisfactory as it does not keep order from the input list of 'A' values.

How can the abovementioned goal be achieved?

syltruong
  • 2,563
  • 20
  • 33

3 Answers3

13

One way to overcome this is to make the 'A' column an index and use loc on the newly generated pandas.DataFrame. Eventually, the subsampled dataframe's index can be reset.

Here is how:

ret = df.set_index('A').loc[list_of_values].reset_index(inplace=False)

# ret is
#      A   B
# 0    3   3
# 1    4   5
# 2    6   2 

Note that the drawback of this method is that the original indexing has been lost in the process.

More on pandas indexing: What is the point of indexing in pandas?

syltruong
  • 2,563
  • 20
  • 33
  • One question - all values of `list_of_values` are in column? Is possible `list_of_values = [3,4,6,7,7,4]` ? – jezrael Aug 24 '18 at 05:32
  • in practice no but its true that this solution has the inconvenience of not handling out of column values – syltruong Aug 24 '18 at 06:07
  • So the best is the most general solution with not remove original indices, working with duplicates values? – jezrael Aug 24 '18 at 06:10
5

Use merge with helper DataFrame created by list and with column name of matched column:

df = pd.DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3,5]})

list_of_values = [3,6,4]
df1 = pd.DataFrame({'A':list_of_values}).merge(df)
print (df1)
   A  B
0  3  3
1  6  2
2  4  5

For more general solution:

df = pd.DataFrame({'A' : [5,6,5,3,4,4,6,5], 'B':range(8)})
print (df)
   A  B
0  5  0
1  6  1
2  5  2
3  3  3
4  4  4
5  4  5
6  6  6
7  5  7

list_of_values = [6,4,3,7,7,4]

#create df from list 
list_df = pd.DataFrame({'A':list_of_values})
print (list_df)
   A
0  6
1  4
2  3
3  7
4  7
5  4

#column for original index values
df1 = df.reset_index()
#helper column for count duplicates values
df1['g'] = df1.groupby('A').cumcount()
list_df['g'] = list_df.groupby('A').cumcount()

#merge together, create index from column and remove g column
df = list_df.merge(df1).set_index('index').rename_axis(None).drop('g', axis=1)
print (df)
   A  B
1  6  1
4  4  4
3  3  3
5  4  5
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 2
    Original index is lost in the process. – Zero Aug 21 '18 at 08:06
  • @Zero - Then is necessary `df1 = pd.DataFrame({'A':list_of_values}).merge(df.reset_index()).set_index('index').rename_axis(None)` – jezrael Aug 21 '18 at 08:08
  • Actually, I found out this approach does not work in the case `list_of_values` contains repeated values: the order is not guaranteed to be kept. Im sorry I had to unvote as answer – syltruong Aug 24 '18 at 03:37
  • 1
    @syltruong - I try create the more general solution for duplicates values `(4)` and also not matched values `(7)`. – jezrael Aug 24 '18 at 06:31
1

1] Generic approach for list_of_values.

In [936]: dff = df[df.A.isin(list_of_values)]

In [937]: dff.reindex(dff.A.map({x: i for i, x in enumerate(list_of_values)}).sort_values().index)
Out[937]:
   A  B
2  3  3
3  4  5
1  6  2

2] If list_of_values is sorted. You can use

In [926]: df[df.A.isin(list_of_values)].sort_values(by='A')
Out[926]:
   A  B
2  3  3
3  4  5
1  6  2
Zero
  • 74,117
  • 18
  • 147
  • 154