1

I have a data frame called lbt_all_epitopes of 38282 rows and three columns, as shown below:

 sequence    score epitope.
1 RPGGPPGYRTPYTAK 1.724911  Epitope
2 TQGDRQKIQDAVSAA 1.664611  Epitope
3 EVKSRYNVDVSQNKR 1.593236  Epitope
4 VIEMTRAFEDDDFDK 1.578200  Epitope
5 ITQGDRQKIQDAVSA 1.533208  Epitope
6 GSADLTPSNLTRPAS 1.532700  Epitope

In the first column (named sequence) I have multiple similar strings, which I want to remove (I will be looking for similar strings using str_sub). For example, considering the first string of lbt_all_epitopes$sequence ("RPGGPPGYRTPYTAK") I want to look for similar strings in the whole column and store them in a vector or in a data.frame, which will be called to_be_removed. I want to do this iteration for the first 30 elements present in lbt_all_epitopes$sequence. For the sake of simplicity, let's just consider the top five rows. When I run the loop, like the one below:

# Iterate over the first 5 rows
top_30 <- 1:5

for(i in top_30) {
  print(agrep(str_sub(lbt_all_epitopes$sequence[i], start = 5, end = 11), lbt_all_epitopes$sequence, value = T))
}

The output:

 [1] "RPGGPPGYRTPYTAK" "VGTRPGGPPGYRTPY" "TRPGGPPGYRTPYTA" "GGPPGYRTPYTAKPF" "PGGPPGYRTPYTAKP"
 [6] "LVGTRPGGPPGYRTP" "TLVGTRPGGPPGYRT" "GPPGYRTPYTAKPFV" "PPGYRTPYTAKPFVM" "GTRPGGPPGYRTPYT"
[11] "PGYRTPYTAKPFVMC"
 [1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA" "GITQGDRQKIQDAVS" "NGITQGDRQKIQDAV" "QGDRQKIQDAVSAAS"
 [6] "QNGITQGDRQKIQDA" "GDRQKIQDAVSAASS" "VQNGITQGDRQKIQD" "DRQKIQDAVSAASSW" "RQKIQDAVSAASSWL"
[11] "QKIQDAVSAASSWLE"
 [1] "EVKSRYNVDVSQNKR" "VKSRYNVDVSQNKRA" "NEVKSRYNVDVSQNK" "KSRYNVDVSQNKRAR" "LNEVKSRYNVDVSQN"
 [6] "YNVDVSQNKRARLRL" "RYNVDVSQNKRARLR" "MLNEVKSRYNVDVSQ" "SRYNVDVSQNKRARL" "HMLNEVKSRYNVDVS"
[11] "EHMLNEVKSRYNVDV"
 [1] "VIEMTRAFEDDDFDK" "RVIEMTRAFEDDDFD" "GDRVIEMTRAFEDDD" "DRVIEMTRAFEDDDF" "IEMTRAFEDDDFDKF"
 [6] "RGDRVIEMTRAFEDD" "EMTRAFEDDDFDKFD" "FRGDRVIEMTRAFED" "MTRAFEDDDFDKFDR" "TRAFEDDDFDKFDRV"
[11] "RAFEDDDFDKFDRVR"
 [1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA" "GITQGDRQKIQDAVS" "NGITQGDRQKIQDAV" "QGDRQKIQDAVSAAS"
 [6] "QNGITQGDRQKIQDA" "GDRQKIQDAVSAASS" "VQNGITQGDRQKIQD" "DVQNGITQGDRQKIQ" "DRQKIQDAVSAASSW"
[11] "RQKIQDAVSAASSWL"

Is exactly what I want i.e. it printed all the similar strings (11 per iteration) to the first, second, third...fifth elements of lbt_all_epitopes$sequence. However, when I try to store the output in a vector (called to_be_removed), with the following loop:

# create the empty vector where I will store the output
to_be_removed <- c()

for(i in top_30) {
  to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], start = 5, end = 11), lbt_all_epitopes$sequence, value = T)
}

I noticed that each iteration produced only a single string as output (as opposed to 11 strings for each iteration), as below:

> to_be_removed
[1] "RPGGPPGYRTPYTAK" "TQGDRQKIQDAVSAA" "EVKSRYNVDVSQNKR" "VIEMTRAFEDDDFDK" "TQGDRQKIQDAVSAA"

The following warning message was displayed:

Warning messages:
1: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i],  :
  number of items to replace is not a multiple of replacement length
2: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i],  :
  number of items to replace is not a multiple of replacement length
3: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i],  :
  number of items to replace is not a multiple of replacement length
4: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i],  :
  number of items to replace is not a multiple of replacement length
5: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i],  :
  number of items to replace is not a multiple of replacement length

I am then assuming that I am missing the code telling R that it should also concatenate all the strings produced by each iteration, then go to the next iteration. Does anyone know how to correctly store the output in a vector, or even in a data.frame?

Cœur
  • 37,241
  • 25
  • 195
  • 267
BCArg
  • 2,094
  • 2
  • 19
  • 37
  • 1
    I'm pretty sure that you cannot store an object of length > 1 in a single entry of a vector. Why not use a list? Try something like `to_be_removed <- lapply(lbt_all_epitopes$sequence[1:5], function(x) agrep(str_sub(x, start = 5, end = 11), lbt_all_epitopes$sequence, value = T))` – LAP Jan 26 '17 at 09:19
  • 1
    By the way, could you provide your dataset in form of `dput(head(lbt_all_epitopes))`? – LAP Jan 26 '17 at 09:21
  • Thanks, it does the job, just as the adapted loop from the colleague below. Do you know any other way to store the output in a data.frame? In this case, it would be best to have a data frame, such that I can look for the strings in to_be_removed in my original dataset (lbt_all_epitopes) to remove them. Thanks. Yes next time I will poste with dput – BCArg Jan 26 '17 at 10:05
  • Well, do you want a single string in every column of the `data.frame`, or just all strings together in one column? – LAP Jan 26 '17 at 10:09
  • I want to store the output such that I can further look for them in my `lbt_all_epitopes`. For example I tried to exclude what was in the `to_be_excluded`list with `subset <- lbt_all_epitopes[!lbt_all_epitopes$sequence %in% to_be_removed, ]` it did not work though. – BCArg Jan 26 '17 at 10:24
  • See my answer, I got you a vector :) – LAP Jan 26 '17 at 10:25

2 Answers2

2

You can create a list :

# create the empty vector where I will store the output
to_be_removed <- list()

for(i in top_30) {
  to_be_removed[[i]] <- agrep(str_sub(lbt_all_epitopes$sequence[i], start = 5, end = 11), lbt_all_epitopes$sequence, value = T)
}

Notice the double bracket to fill the list.

Also next time please post your data using dput so we can use it directly. To do so do : dput(lbt_all_epitopes) which returns :

structure(list(X = 1:6, sequence = structure(c(4L, 5L, 1L, 6L, 
3L, 2L), .Label = c("EVKSRYNVDVSQNKR", "GSADLTPSNLTRPAS", "ITQGDRQKIQDAVSA", 
"RPGGPPGYRTPYTAK", "TQGDRQKIQDAVSAA", "VIEMTRAFEDDDFDK"), class = "factor"), 
    score = structure(c(6L, 5L, 4L, 3L, 2L, 1L), .Label = c("1.532700", 
    "1.533208", "1.578200", "1.593236", "1.664611", "1.724911"
    ), class = "factor"), epitope. = structure(c(1L, 1L, 1L, 
    1L, 1L, 1L), .Label = "Epitope", class = "factor")), .Names = c("X", 
"sequence", "score", "epitope."), class = "data.frame", row.names = c(NA, 
-6L))
Etienne Kintzler
  • 672
  • 6
  • 12
  • if I use the command you mentioned, I got: class = "factor"), score = c(1.7249113, 1.6646106, 1.5932359, 1.5782, 1.5332078, 1.5326996), epitope. = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Epitope", "Non-Epitope"), class = "factor")), .Names = c("sequence", "score", "epitope."), row.names = c(NA, 6L), class = "data.frame") > Is this what you mean? Thanks the loop does the job, although it would be nicer to store the output in a data.frame. Any idea on how to do that? – BCArg Jan 26 '17 at 10:06
  • Thanks for providing the `dput()`, @EtienneKintzler! – LAP Jan 26 '17 at 10:23
  • 1
    Yes `dput()` is really awesome @LeoP. I found it on http://stackoverflow.com/questions/1295955/what-is-the-most-useful-r-trick you can check, you might learn some interesting functions – Etienne Kintzler Jan 26 '17 at 10:35
  • It's weird I get a different output when I type the very same command as you, or very similar as `dput(head(lbt_all_epitopes))`... – BCArg Jan 26 '17 at 10:47
  • 1
    @BCArg you can store the results in the dataframe because the result of each iteration doesn't have the same length. If you know the maximal length of the output for each iteration you can use the following code (for instance 6): `tmp <- lapply(to_be_removed, function(x) {length(x) <- 6; x} data.frame(tmp)` The function in lapply will change the length of every list in the list, and put NA if the lists in the list have less than 6 elements. – Etienne Kintzler Jan 26 '17 at 10:47
  • 1
    @BCArg You didn't copy and paste correctly the output of dput since the output doesn't begin with `structure(..`. Also since I copy paste your data in excel then import them in R it's possible that the elements within the structure differs; for instance the field `.Names` in the output of my dput does contain the value `X` because read.csv import the row.names (1,2,..6) as a column (with default name `X`) – Etienne Kintzler Jan 26 '17 at 10:52
  • Oh, now I see, I did the very same thing as you (copy to excel than import to R again) and could "restore" the output using the dput output. Thanks a lot! – BCArg Jan 26 '17 at 11:07
  • Ok but the point is you don't need to copy to excel then import to R, you can just use `dput` on whatever R object you want ! I was just saying that the output of `dput` you copy paste was incomplete (because it doesn't begin with `structure(` and can be different to mine. – Etienne Kintzler Jan 26 '17 at 14:03
  • Yes, I know it should work like that, however, the `dput()` output directly from R was something bizarre, very long. It is only working when I am importing a data frame from Excel (which was previously saved with `write.csv()`from R. I have no idea why is that.. – BCArg Jan 27 '17 at 08:46
1

To avoid a growing for()-loop, we can use lapply(). This should be faster when handling huge datasets.

to_be_removed <- lapply(lbt_all_epitopes$sequence[1:5], function(x) agrep(str_sub(x, start = 5, end = 11), lbt_all_epitopes$sequence, value = T))

gives a list with the extracted strings for each row in a separate list entry:

[[1]]
[1] "RPGGPPGYRTPYTAK"

[[2]]
[1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA"

[[3]]
[1] "EVKSRYNVDVSQNKR"

[[4]]
[1] "VIEMTRAFEDDDFDK"

[[5]]
[1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA"

Now you can separate those with strsplit() and unlist() them into a vector (which you could use to subset):

to_be_removed <- unlist(lapply(to_be_removed, function(x) strsplit(x, " ")))

Output:

[1] "RPGGPPGYRTPYTAK" "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA" "EVKSRYNVDVSQNKR" "VIEMTRAFEDDDFDK" "TQGDRQKIQDAVSAA"
[7] "ITQGDRQKIQDAVSA"
LAP
  • 6,605
  • 2
  • 15
  • 28
  • Excellent, this is exactly what I want! I also tried the `dput` command from etiennekintzler (`dput(lbt_all_epitopes)`) and I got something completely different. Do you know why? – BCArg Jan 26 '17 at 10:37
  • Glad to help! `dput()` gives you an output for your whole dataframe, which usually is quite big and therefore the code is pretty long. For an example for SO, use either `dput(head(yourdata))` or - if that is insufficient - manually limit the dimensions: `dput(yourdata[1:20, 1:5])`. – LAP Jan 26 '17 at 10:40