0

I have a data set like this:

query  hit  score1  score2
group1  hit1  0.9  0.8
group1  hit2  0.2  0.9
group1  hit3  0.2  0.9
group2  hit1  0.1  0.3
group2  hit2  0.6  0.8
group2  hit3  0.3  0.3
group2  hit4  0.7  0.5

I want the output to return the lowest n (e.g. 2) rows, per group, based on a columnn (e.g. score 1), so this would turn into:

query  hit  score1  score2
group1  hit2  0.2  0.9
group1  hit3  0.2  0.9
group2  hit1  0.1  0.3
group2  hit3  0.3  0.3

I wrote:

df = pd.read_csv('file_name.txt',sep='\t',header=0)
df = df.groupby('query')['score1'].nsmallest(2).join(df,['score1'])
print(df)

based on seeing this answer.

But I get the error series has no attribute join.

I did this because when I run:

df = pd.read_csv('file_name.txt',sep='\t',header=0)
df = df.groupby('query')['score1'].nsmallest(2)
print(df)

but it only returns the column that the grouping is done by, I wasn't sure how to extend to return the whole row, hence why I tried based on a SO answer that gave me the error.

Slowat_Kela
  • 1,377
  • 2
  • 22
  • 60
  • If your data is ordered, you could try an answer from here https://stackoverflow.com/questions/20069009/pandas-get-topmost-n-records-within-each-group. In your case, the line in question would be `df.groupby('query').head(2)`. The answer @not_speshal is probably better as it explicitly orders the data. – Harpe Mar 24 '22 at 14:28

2 Answers2

4

Try using rank:

>>> df[df.groupby("query")["score1"].transform("rank").le(2)]

    query   hit  score1  score2
1  group1  hit2     0.2     0.9
2  group1  hit3     0.2     0.9
3  group2  hit1     0.1     0.3
5  group2  hit3     0.3     0.3

Alternatively with nsmallest and loc:

>>> df.loc[df.groupby('query')['score1'].nsmallest(2).reset_index(level=0, drop=True).index]
    query   hit  score1  score2
1  group1  hit2     0.2     0.9
2  group1  hit3     0.2     0.9
3  group2  hit1     0.1     0.3
5  group2  hit3     0.3     0.3
not_speshal
  • 22,093
  • 2
  • 15
  • 30
3

Sort first, and then get the first N rows from each group.

>>> df.sort_values(["query", "score1"]).groupby("query").head(2) 
    query   hit  score1  score2
1  group1  hit2     0.2     0.9
2  group1  hit3     0.2     0.9
3  group2  hit1     0.1     0.3
5  group2  hit3     0.3     0.3

Alternatively, you can use nsmallest in a groupby.apply:

>>> df.groupby("query").apply(lambda s: s.nsmallest(2, "score1"))
           query   hit  score1  score2
query                                 
group1 1  group1  hit2     0.2     0.9
       2  group1  hit3     0.2     0.9
group2 3  group2  hit1     0.1     0.3
       5  group2  hit3     0.3     0.3
Cameron Riddell
  • 10,942
  • 9
  • 19