How can this R code be sped up with the apply (lapply, mapply ect.) functions?

Question

I am not to proficient with the apply functions, or with R. But I know I overuse for loops which makes my code slow. How can the following code be sped up with apply functions, or in any other way?

sum_store = NULL
for (col in 1:ncol(cazy_fams)){ # for each column in cazy_fams (so for each master family eg. GH, AA ect...)
  for (row in 1:nrow(cazy_fams)){ # for each row in cazy fams (so the specific family number e.g GH1 AA7 ect...)
    # Isolating the row that pertains to the current cazy family being looked at for every dataframe in the list
    filt_fam = lapply(family_summary, function(sample){
      sample[as.character(sample$Family) %in% paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = ""),]
    })
    row_cat = do.call(rbind, filt_fam) # concatinating the lapply list output int a dataframe
    if (nrow(row_cat) > 0){
      fam_sum = aggregate(proteins ~ Family, data=row_cat, FUN=sum) # collapsing the dataframe into one row and summing the proteins count
      sum_store = rbind(sum_store, fam_sum) # storing the results for that family
    } else if (grepl("NA", paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")) == FALSE) {
      Family = paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")
      proteins = 0
      sum_store = rbind(sum_store, data.frame(Family, proteins))
    } else {
      next
    }
  }
}

family_summary is just a list of 18 two column dataframes that look like this:

Family proteins
CE0        2
CE1        9
CE4       15
CE7        1
CE9        1
CE14       10
GH0        5
GH1        1
GH3        4
GH4        1
GH8        1
GH9        2
GH13        2
GH15        5
GH17        1

with different cazy families.

cazy_fams is just a dataframe with each coulms being a cazy class (eg. GH, AA ect...) and ech row being a family number, all taken from the linked website:

GH GT PL CE AA CBM
1  1  1  1  1   1
2  2  2  2  2   2
3  3  3  3  3   3
4  4  4  4  4   4
5  5  5  5  5   5
6  6  6  6  6   6
7  7  7  7  7   7
8  8  8  8  8   8
9  9  9  9  9   9
10 10 10 10 10  10
11 11 11 11 11  11
12 12 12 12 12  12
13 13 13 13 13  13
14 14 14 14 14  14
15 15 15 15 15  15

The reason behind the else if (grepl("NA", paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")) == FALSE) statment is to deal with the fact not all classes have the same number of family so when looping over my dataframe I end up with some GHNA and AANA with NA on the end.

The output sum_store is this:

Family proteins
GH1       54
GH2       51
GH3      125
GH4       29
GH5       40
GH6       25
GH7        0
GH8       16
GH9       25
GH10       19
GH11        5
GH12        5
GH13      164
GH14        3
GH15       61

A dataframe with all listed cazy families and the total number of apperances across the family_summary list. Please let me know if you need anything else to help answer my question.

Please show us fuller amount of data (multiple rows) and desired results which helps illustrate more than dense code and words. See [How to make a great R reproducible example](https://stackoverflow.com/q/5963269/1422451). — Parfait, Jan 31 '20 at 16:39
Here is a great post for your problem: https://stackoverflow.com/questions/38649411/r-speed-up-the-for-loop-using-apply-or-lapply-or-etc — Florian, Feb 03 '20 at 08:50
Can you align your expected output with the small sample input? And please explain *total number of apperances*. For instance, what does 54 represent for GH1 in first row? And how is *cazy_fams* used? — Parfait, Feb 03 '20 at 18:06
Not really as the sample input is a small representation from a list of 18 dataframes. Total number of apperances means across every dataframe in the `family_summary`list, what is the sum of each cazy families proteins count. `cazy_fams`is just used as a input dataframe for listing all cazy families that exist in the cazy database. — Lamma, Feb 04 '20 at 08:04

How can this R code be sped up with the apply (lapply, mapply ect.) functions?

0 Answers0