0

I want to retrieve in the Pfam_domains column all the names mentioned at least once.

Here is my dataframe:

           TCID       Fonction        Genbank Uniprot Pfam_domains
0     3.A.1.1.1           MalE           MalE  P0AEX9      PF00528
1     3.A.1.1.1           MalF           MalF  P02916      PF01547
2     3.A.1.1.1           MalG           MalG  P68183      PF00528
3     3.A.1.1.1           MalK           MalK  P68187      PF00005
4     3.A.1.1.1           MalK           MalK  P68187      PF17912
..          ...            ...            ...     ...          ...
178  3.A.1.5.32  LAC30SC_07295  LAC30SC_07295  F0TFS7      PF00528
179  3.A.1.5.32  LAC30SC_07300  LAC30SC_07300  F0TFS8      PF00528
180  3.A.1.5.32  LAC30SC_07305  LAC30SC_07305  F0TFS9      PF00005
181  3.A.1.5.32  LAC30SC_07305  LAC30SC_07305  F0TFS9      PF08352
182  3.A.1.5.32  LAC30SC_07310  LAC30SC_07310  F0TFT0      PF00005

This is my code:

for i in range(1, len(df)-1):
    unite=pd.unique(df['Pfam_domains'][i])

Here, the problem is that I only list all domains (all occurrences of all domains).

Here is what I would like to have in output:

"PF00528"
"PF01547"
"PF00005"
...
lmj
  • 1
  • 3

2 Answers2

1

I believe this is what you're looking for.

unite = df['Pfam_domains'].unique()
unite.sort()
rhug123
  • 7,893
  • 1
  • 9
  • 24
0

start by sorting the value counts in ascending order:

df.Pfam_domains.value_counts().sort_values(ascending=False)

by definition of a dataframe, this will satisfy your request for values that are "mentioned at least once". if they are in the dataframe - they are mentioned "at least once". If you're actually looking for values that appear MORE than once, then this is also a good starting point.

aggis
  • 608
  • 4
  • 9