0

This problem is unsolved by my brain, so I'm asking all of you for a little help.

This is part of my data:

rfam[1:20,]
     id              name
1  RF00001  LL_skoljka_r41782307_x1
2  RF00001   LL_skoljka_r9950955_x1
3  RF00001  LL_skoljka_r49323482_x1
4  RF00001  LL_skoljka_r14141437_x1
5  RF00001  LL_skoljka_r16457227_x3
6  RF00002  LL_skoljka_r40347558_x1
7  RF00002  LL_skoljka_r44415149_x1
8  RF00002  LL_skoljka_r13145032_x1
9  RF00002 LL_skoljka_r29248915_x42
10 RF00003  LL_skoljka_r15936986_x1
11 RF00003  LL_skoljka_r28953530_x1
12 RF00003  LL_skoljka_r32665758_x1
13 RF00003  LL_skoljka_r32835489_x1
14 RF00003  LL_skoljka_r32835498_x1
15 RF04051  LL_skoljka_r33254611_x1
16 RF04051 LL_skoljka_r29761867_x12
17 RF04051  LL_skoljka_r45123665_x2
18 RF04051 LL_skoljka_r34837827_x15
19 RF08595  LL_skoljka_r38900754_x1
20 RF08595  LL_skoljka_r22016530_x1

In first step I want to remove all the nonsense before x in variable name so I use:

rfam$name<- as.data.frame(sapply(rfam$name, gsub, pattern='^.*?x', replacement=""))

Result:

rfam[1:20,]
     id       name
1  RF00001       1
2  RF00001       1
3  RF00001       1
4  RF00001       1
5  RF00001       3
6  RF00002       1
7  RF00002       1
8  RF00002       1
9  RF00002      42
10 RF00003       1
11 RF00003       1
12 RF00003       1
13 RF00003       1
14 RF00003       1
15 RF04051       1
16 RF04051      12
17 RF04051       2
18 RF04051      15
19 RF08595       1
20 RF08595       1

In second step I would like to sum up values that stay in variable name for each id.

Results should look like this:

view(rfam)
     id       name
1  RF00001       7
2  RF00002      45
3  RF00003       5
4  RF04051      30 
5  RF08595       2

If I want to sum up values, variable should be numeric. Both of my variables are factors. So I transformed id to character using rfam[,1]=as.character(rfam[,1]) and tried to convert name to numeric by rfam[,2]=as.numeric(levels(rfam[,2])[rfam[,2]]). Transformation of id was successful, while name returns "NA's".

I've also tried rfam[,2]=as.numeric(as.character(rfam[,2])), but the result was the same.

I've tried to export data to txt file and then in excel do the rest of analysis, but when I export data, it looks like this:

      "id"     "name"
"1" "RF00001"   c(1, 1, 1, 1, 9, 1, 1, 1, 11, 1, 1, 1, 1, 1, 1, 3, 7, 5, 1, 1, 1, 9, 1, 14, 10, 7, 1, 5, 1, 1, 1, 1, 1, 7, 1, 2, 1, 1, 1, 9, 1, 7, 1, 1, 1, 1, 1, 1, 10, 7, 1, 10, 7, 1, 1, 1, 1, 1, 7, 1, 10, 1, 1, 1, 1, 1, 1, 1, 7, 1,...) 
"2" "RF00001"   c(1, 1, 1, 1, 9, 1, 1, 1, 11, 1, 1, 1, 1, 1, 1, 3, 7, 5, 1, 1, 1, 9, 1, 14, 10, 7, 1, 5, 1, 1, 1, 1, 1, 7, 1, 2, 1, 1, 1, 9, 1, 7, 1, 1, 1, 1, 1, 1, 10, 7, 1, 10, 7, 1, 1, 1, 1, 1, 7, 1, 10, 1, 1, 1, 1, 1, 1, 1, 7, 1,...)    
"3" "RF00001"   c(1, 1, 1, 1, 9, 1, 1, 1, 11, 1, 1, 1, 1, 1, 1, 3, 7, 5, 1, 1, 1, 9, 1, 14, 10, 7, 1, 5, 1, 1, 1, 1, 1, 7, 1, 2, 1, 1, 1, 9, 1, 7, 1, 1, 1, 1, 1, 1, 10, 7, 1, 10, 7, 1, 1, 1, 1, 1, 7, 1, 10, 1, 1, 1, 1, 1, 1, 1, 7, 1,...)

Now here is my dead end. I don't understand what is happening and I would appreciate if you could help me out.

sursek
  • 43
  • 4

1 Answers1

0

Update

Having realized your question is not about the grouping part, the problem is that your sapply() function is creating a data.frame inside rfam instead of a vector.

You can use the following data.table solution to correctly convert the rfam$name column to the desired format to be able to group.

setDT(rfam)[,name:= as.numeric(gsub('^.*?x', replacement="",name))]

Now we can use dplyr to attain the desired output:

library(dplyr)
as.data.frame(rfam) %>% group_by(id) %>% summarise(name=sum(name))
mtoto
  • 23,919
  • 4
  • 58
  • 71
  • `rfam$name <- as.numeric(rfam$name)` doesn't work. Nonetheless I figure out by using strsplit function. I still don't understand what is happening after running gsub function. Thanks for the effort. – sursek Jan 05 '16 at 06:12
  • Thumbs up to you **mtoto** for your updated answer. It works like a charm, no weird data. What still bugs me, is that data appeared fine in R after using `rfam$name<- as.data.frame(sapply(rfam$name, gsub, pattern='^.*?x', replacement=""))`. But when I tried to export it, then all this c's and weird numbers appeared. I don't understand why R doesn't show this or put some warning on. – sursek Jan 07 '16 at 07:03
  • Thats because your data frame inside the dataframe has only one col, see `str(rfam)` – mtoto Jan 07 '16 at 07:05