I have a data frame called lbt_all_epitopes
of 38282 rows and three columns, as shown below:
sequence score epitope.
1 RPGGPPGYRTPYTAK 1.724911 Epitope
2 TQGDRQKIQDAVSAA 1.664611 Epitope
3 EVKSRYNVDVSQNKR 1.593236 Epitope
4 VIEMTRAFEDDDFDK 1.578200 Epitope
5 ITQGDRQKIQDAVSA 1.533208 Epitope
6 GSADLTPSNLTRPAS 1.532700 Epitope
In the first column (named sequence) I have multiple similar strings, which I want to remove (I will be looking for similar strings using str_sub
). For example, considering the first string of lbt_all_epitopes$sequence
("RPGGPPGYRTPYTAK") I want to look for similar strings in the whole column and store them in a vector
or in a data.frame
, which will be called to_be_removed
. I want to do this iteration for the first 30 elements present in lbt_all_epitopes$sequence
. For the sake of simplicity, let's just consider the top five rows. When I run the loop, like the one below:
# Iterate over the first 5 rows
top_30 <- 1:5
for(i in top_30) {
print(agrep(str_sub(lbt_all_epitopes$sequence[i], start = 5, end = 11), lbt_all_epitopes$sequence, value = T))
}
The output:
[1] "RPGGPPGYRTPYTAK" "VGTRPGGPPGYRTPY" "TRPGGPPGYRTPYTA" "GGPPGYRTPYTAKPF" "PGGPPGYRTPYTAKP"
[6] "LVGTRPGGPPGYRTP" "TLVGTRPGGPPGYRT" "GPPGYRTPYTAKPFV" "PPGYRTPYTAKPFVM" "GTRPGGPPGYRTPYT"
[11] "PGYRTPYTAKPFVMC"
[1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA" "GITQGDRQKIQDAVS" "NGITQGDRQKIQDAV" "QGDRQKIQDAVSAAS"
[6] "QNGITQGDRQKIQDA" "GDRQKIQDAVSAASS" "VQNGITQGDRQKIQD" "DRQKIQDAVSAASSW" "RQKIQDAVSAASSWL"
[11] "QKIQDAVSAASSWLE"
[1] "EVKSRYNVDVSQNKR" "VKSRYNVDVSQNKRA" "NEVKSRYNVDVSQNK" "KSRYNVDVSQNKRAR" "LNEVKSRYNVDVSQN"
[6] "YNVDVSQNKRARLRL" "RYNVDVSQNKRARLR" "MLNEVKSRYNVDVSQ" "SRYNVDVSQNKRARL" "HMLNEVKSRYNVDVS"
[11] "EHMLNEVKSRYNVDV"
[1] "VIEMTRAFEDDDFDK" "RVIEMTRAFEDDDFD" "GDRVIEMTRAFEDDD" "DRVIEMTRAFEDDDF" "IEMTRAFEDDDFDKF"
[6] "RGDRVIEMTRAFEDD" "EMTRAFEDDDFDKFD" "FRGDRVIEMTRAFED" "MTRAFEDDDFDKFDR" "TRAFEDDDFDKFDRV"
[11] "RAFEDDDFDKFDRVR"
[1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA" "GITQGDRQKIQDAVS" "NGITQGDRQKIQDAV" "QGDRQKIQDAVSAAS"
[6] "QNGITQGDRQKIQDA" "GDRQKIQDAVSAASS" "VQNGITQGDRQKIQD" "DVQNGITQGDRQKIQ" "DRQKIQDAVSAASSW"
[11] "RQKIQDAVSAASSWL"
Is exactly what I want i.e. it printed all the similar strings (11 per iteration) to the first, second, third...fifth elements of lbt_all_epitopes$sequence
.
However, when I try to store the output in a vector (called to_be_removed
), with the following loop:
# create the empty vector where I will store the output
to_be_removed <- c()
for(i in top_30) {
to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], start = 5, end = 11), lbt_all_epitopes$sequence, value = T)
}
I noticed that each iteration produced only a single string as output (as opposed to 11 strings for each iteration), as below:
> to_be_removed
[1] "RPGGPPGYRTPYTAK" "TQGDRQKIQDAVSAA" "EVKSRYNVDVSQNKR" "VIEMTRAFEDDDFDK" "TQGDRQKIQDAVSAA"
The following warning message was displayed:
Warning messages:
1: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], :
number of items to replace is not a multiple of replacement length
2: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], :
number of items to replace is not a multiple of replacement length
3: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], :
number of items to replace is not a multiple of replacement length
4: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], :
number of items to replace is not a multiple of replacement length
5: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], :
number of items to replace is not a multiple of replacement length
I am then assuming that I am missing the code telling R that it should also concatenate all the strings produced by each iteration, then go to the next iteration.
Does anyone know how to correctly store the output in a vector
, or even in a data.frame
?