How can i add a column based on some condition quickly in R?

Question

I have a dataframe that contains one column and i want to make another column based on some condition in the first column. Here is my script that i have written so far and it works but it is very slow since it has around 50k rows.

 data <- read.table("~/Documents/git_repos/Aspen/Reference_genome/Potrs01-genome_mod_id.txt")
> dim(data) # [1] 509744      1
> head(data)
           V1
1 Potrs000004
2 Potrs000004
3 Potrs000004
4 Potrs000004
5 Potrs000004
6 Potrs000004

test <- paste("Potrs00000", seq(000001,10000,by=1), sep ="")
length(test) # [1] 10000
> head(test)
[1] "Potrs000001" "Potrs000002" "Potrs000003" "Potrs000004" "Potrs000005"
[6] "Potrs000006"

test.m <- matrix("NA", nrow = 509744, ncol = 2 )
dim(test.m) # [1] 509744      2
> head(test.m)
     [,1] [,2]
[1,] "NA" "NA"
[2,] "NA" "NA"
[3,] "NA" "NA"
[4,] "NA" "NA"
[5,] "NA" "NA"
[6,] "NA" "NA"

 for (i in test) {
   for (j in data$V1) {
     if (i == j)
       test.m[,1] = j
       test.m[,2] = "chr9"
      }
    }
test.d <- as.data.frame(test.m)
> head(test.d)
           V1   V2
1 Potrs000004 chr9
2 Potrs000004 chr9
3 Potrs000004 chr9
4 Potrs000004 chr9
5 Potrs000004 chr9
6 Potrs000004 chr9

Is there a way to modify the code to speed it up?

pro tip: you don't need to specify `"NA"` in `matrix`, it's the default so you can just write `matrix(nrow=...,ncol=2)` — MichaelChirico, Aug 28 '15 at 23:25
@VeerendraGadekar, i have added the sample data and the desired output — upendra, Aug 28 '15 at 23:26
@upendra you could try `library(data.table); setDT(data)[V1 %in% test, V2 := "chr9"]` — Veerendra Gadekar, Aug 28 '15 at 23:27
@VeerendraGadekar not exactly, since `data` will still have unmatched values in it, though yours can be easily adjusted, see below. — MichaelChirico, Aug 28 '15 at 23:31
@MichaelChirico what do you mean here my unmatched values? could you give any example. I see this works fine — Veerendra Gadekar, Aug 28 '15 at 23:32
See my sample `data`. The code will only work to your specification _if_ every value of `V1` is in `test` somewhere. Any value of `V1` not in test will remain, but have `V2==NA`. — MichaelChirico, Aug 28 '15 at 23:34

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

2

It seems like you want the values of V1 from data which match an element in test.

I would do this with data.table:

library(data.table)
setDT(data)
data[,.(V1[V1 %in% test], "chr9")]

Note that the result is already a data.table (which is also a data.frame)

Sample Data

set.seed(10239)
data<-data.frame(V1=sample(c(test[1:10],LETTERS[1:10]),10))
> data
            V1
1            D
2            A
3            E
4  Potrs000006
5  Potrs000001
6  Potrs000007
7  Potrs000008
8  Potrs000003
9            B
10 Potrs000002
setDT(data)
> data[,.(V1[V1 %in% test], "chr9")]
            V1   V2
1: Potrs000006 chr9
2: Potrs000001 chr9
3: Potrs000007 chr9
4: Potrs000008 chr9
5: Potrs000003 chr9
6: Potrs000002 chr9

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 28 '15 at 23:21

MichaelChirico

33,841
14
113
198

Note that thefirst row of the result of `data[V1 %in% test,V2:="chr9"]` is `1: D NA` – MichaelChirico Aug 28 '15 at 23:36
This worked fine too but i had to write it to a different object. `data2 <- data[,.(V1[V1%in%test],"chr9")]` – upendra Aug 28 '15 at 23:36
@upendra yes, see comment above. If you're fine having `NA` in your data the other approach is fine. If all of `data$V1` is in `test` somewhere, then your problem is trivial, and you should instead just write `data$V2<-"chr"` and be done with it. – MichaelChirico Aug 28 '15 at 23:37
@MichaelChirico for me using `:=` looks a standard approach. And the unmatched vales will be `NA`s which can be easily removed in the next step – Veerendra Gadekar Aug 28 '15 at 23:39
@VeerendraGadekar they can't (without a lot of effort) be removed without making a copy (see [here](http://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-r-data-table)), so why not just remove them in the first step like I did? – MichaelChirico Aug 28 '15 at 23:39
@VeerendraGadekar that will return an object without `NA` values, but you'll need to assign it in order for it to be stored in memory, i.e., make a copy. Same for `data[V1 %in% test, V2:="chr9"][!is.na(V2)]`. – MichaelChirico Aug 28 '15 at 23:41
Thank you both for the solution and the valuable information – upendra Aug 28 '15 at 23:50

How can i add a column based on some condition quickly in R?

1 Answers1

Sample Data