0

I need to make a frequency dictionary from a pandas series (from the 'amino_acid' column in dataframe below) that also adds an adjacent row for each entry in the dictionary (from 'templates' column).

    templates   amino_acid
0   118       CAWSVGQYSNQPQHF
1   635       CASSLRGNQPQHF
2   468       CASSHGTAYEQYF
3   239       CASSLDRLSSGEQYF
4   51        CSVEDGPRGTQYF

My current approach of iterating through the dataframe seems to be inefficient and even an anti-pattern according to this post. How can I improve the efficiency/use best practice for doing this?

My current approach:

sequence_counts = {}
seqs = list(zip(df.amino_acid, df.templates))

for seq in seqs:
    if seq[0] not in sequence_counts:
        sequence_counts[seq[0]] = 0
    sequence_counts[seq[0]] += seq[1]

I've seen people the below way, but can't figure out how to adjust it to add each respective 'templates' entry:

sequence_counts = df['amino_acid'].value_counts().to_dict()

Any help/feedback would be greatly appreciated! :)

Luke
  • 11
  • 1
  • 4
  • 2
    It sounds like you want to groupby-aggregate, is this what you're looking for? `df.groupby('amino_acid').templates.sum()`. If not, could you clarify what some sample output would look like? – Nolan Conaway Jun 24 '19 at 01:13
  • Yes thank you @NolanConaway! That's exactly what I was looking for. I needed to get the frequency of each amino_acid string plus the sum of the templates entry for each occurrence. – Luke Jun 25 '19 at 16:36

2 Answers2

0

Just tested the code of @Nolan Conaway comment and it is the best thing to do:

df.groupby('amino_acid').templates.sum()

With this, you get a dataframe containing what you need, and since it uses all dataframe native functions, runs faster and is of course more concise, short and clean.

For the speed, I measured the elapsed time in a 10^4 dataframe and this code is about three orders of magnitude faster (0.007 vs 4.3 seconds) than my answer below.

Nolan should put the comment in an answer so he can be credited by his neat and clever use of pandas dataframe api.

I will leave here my answer just in case some one find the comments useful.

I don't know pandas api completely, but I can't find any combination of the api that would get you what you needed (but Nolan did!). But it seems you can improve your code a lot by not creating list or explicitly ziping the data. If you use iterators instead of those structures you can improve the performance.

For example, in list(zip(df.amino_acid, df.templates)), the list is not really necessary because zip already returns a list. Furthermore, you could use the izip function of itertools library, that gives an iterator without building a list. Also, it is better to use pandas iterator constructors instead of calling the columns (that as far as I understand, will return also a copy of the data in a list, so you have there yet another iteration over the dataframe).

Anyway, I would try something like this.

sequence_counts = { }
for _, row in df.iterrows():
    t, aa = row['templates'], row['amino_acid']
    s = sequence_counts.get(aa, 0)
    sequence_counts[aa] = s + t

In this way you are really iterating through the data just once, with the iterator the dataframe gives you.

eguaio
  • 3,754
  • 1
  • 24
  • 38
-1

My understanding from your question is that you wish to create a dictionary key/value such that key=amino_acid and value is the frequency = templates

Since you have successfully created the tuples with seqs = list(zip(df.amino_acid, df.templates))

your dictionary can be constructed as:

sequence_counts = dict(seqs)

in one line:

sequence_counts = dict(zip(df.amino_acid, df.templates))

or you can do something from this nature:

sequence_counts = dict([(k,v) for k,v in zip(df.amino_acid,df.templates)])
adhg
  • 10,437
  • 12
  • 58
  • 94
  • The code in the question shows that he needs the sum of templates of the same amino_acid. Besides, making the list inside the dict constructor is nor really necessary and results in worst performance. – eguaio Jun 24 '19 at 02:52