0

I am not to proficient with the apply functions, or with R. But I know I overuse for loops which makes my code slow. How can the following code be sped up with apply functions, or in any other way?

sum_store = NULL
for (col in 1:ncol(cazy_fams)){ # for each column in cazy_fams (so for each master family eg. GH, AA ect...)
  for (row in 1:nrow(cazy_fams)){ # for each row in cazy fams (so the specific family number e.g GH1 AA7 ect...)
    # Isolating the row that pertains to the current cazy family being looked at for every dataframe in the list
    filt_fam = lapply(family_summary, function(sample){
      sample[as.character(sample$Family) %in% paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = ""),]
    })
    row_cat = do.call(rbind, filt_fam) # concatinating the lapply list output int a dataframe
    if (nrow(row_cat) > 0){
      fam_sum = aggregate(proteins ~ Family, data=row_cat, FUN=sum) # collapsing the dataframe into one row and summing the proteins count
      sum_store = rbind(sum_store, fam_sum) # storing the results for that family
    } else if (grepl("NA", paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")) == FALSE) {
      Family = paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")
      proteins = 0
      sum_store = rbind(sum_store, data.frame(Family, proteins))
    } else {
      next
    }
  }
}

family_summary is just a list of 18 two column dataframes that look like this:

Family proteins
CE0        2
CE1        9
CE4       15
CE7        1
CE9        1
CE14       10
GH0        5
GH1        1
GH3        4
GH4        1
GH8        1
GH9        2
GH13        2
GH15        5
GH17        1

with different cazy families.

cazy_fams is just a dataframe with each coulms being a cazy class (eg. GH, AA ect...) and ech row being a family number, all taken from the linked website:

GH GT PL CE AA CBM
1  1  1  1  1   1
2  2  2  2  2   2
3  3  3  3  3   3
4  4  4  4  4   4
5  5  5  5  5   5
6  6  6  6  6   6
7  7  7  7  7   7
8  8  8  8  8   8
9  9  9  9  9   9
10 10 10 10 10  10
11 11 11 11 11  11
12 12 12 12 12  12
13 13 13 13 13  13
14 14 14 14 14  14
15 15 15 15 15  15

The reason behind the else if (grepl("NA", paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")) == FALSE) statment is to deal with the fact not all classes have the same number of family so when looping over my dataframe I end up with some GHNA and AANA with NA on the end.

The output sum_store is this:

Family proteins
GH1       54
GH2       51
GH3      125
GH4       29
GH5       40
GH6       25
GH7        0
GH8       16
GH9       25
GH10       19
GH11        5
GH12        5
GH13      164
GH14        3
GH15       61

A dataframe with all listed cazy families and the total number of apperances across the family_summary list. Please let me know if you need anything else to help answer my question.

Lamma
  • 895
  • 1
  • 12
  • 26
  • Please show us fuller amount of data (multiple rows) and desired results which helps illustrate more than dense code and words. See [How to make a great R reproducible example](https://stackoverflow.com/q/5963269/1422451). – Parfait Jan 31 '20 at 16:39
  • I have added the requested information. – Lamma Feb 03 '20 at 08:47
  • Here is a great post for your problem: https://stackoverflow.com/questions/38649411/r-speed-up-the-for-loop-using-apply-or-lapply-or-etc – Florian Feb 03 '20 at 08:50
  • Can you align your expected output with the small sample input? And please explain *total number of apperances*. For instance, what does 54 represent for GH1 in first row? And how is *cazy_fams* used? – Parfait Feb 03 '20 at 18:06
  • Not really as the sample input is a small representation from a list of 18 dataframes. Total number of apperances means across every dataframe in the `family_summary`list, what is the sum of each cazy families proteins count. `cazy_fams`is just used as a input dataframe for listing all cazy families that exist in the cazy database. – Lamma Feb 04 '20 at 08:04

0 Answers0