1

I am trying to work out how I would combine an ifelse statement with the shift function in data.table. My data looks like this:

DF <- structure(list(CHR = c(1, 1, 1, 1, 1,1), 
SNP = c("rs2494631", "rs4648637", "rs2494627", "rs11122119", "rs1844583","rs2292242"), 
BP = c(2399149, 2401364, 2402499, 6768856, 8383469, 8385059), 
KBdist= c(NA, 2215, 1135, 4366357, 1614613, 1590), 
locus = c(1, NA, NA, NA, NA, NA)), 
.Names = c("CHR","SNP","BP","KBdist","locus"), 
row.names = c(NA, 6L), 
class = "data.frame")

> df

CHR SNP        BP       KBdist   locus
1   rs2494631  2399149  NA       1
1   rs4648637  2401364  2215     NA
1   rs2494627  2402499  1135     NA
1   rs11122119 6768856  4366357  NA
1   rs1844583  8383469  1614613  NA
1   rs2292242  8385059  1590     NA

and what I am trying to achieve is: "If CHR is equal to the line above, and KBdist is less than 500,000, make locus equal to the line above, else add one to the value of the line above". Which would yield an output that looks like this:

CHR SNP        BP       KBdist   locus
1   rs2494631  2399149  NA       1
1   rs4648637  2401364  2215     1
1   rs2494627  2402499  1135     1
1   rs11122119 6768856  4366357  2
1   rs1844583  8383469  1614613  3
1   rs2292242  8385059  1590     3

I know that I can use shift to access the values in the row above, for example:

DF<-DF[ , KBdist := BP - shift(BP, 1L, type="lag")]

As that is how I created one of the columns. But I don't see how you could extend it to including the ifelse statement conditions above.

Any help would be greatly appreciated.

Thanks in advance.

Lynsey
  • 339
  • 1
  • 2
  • 11
  • Could you `dput` your sample data? Also, to be clear, `else add one` means adding one to the current value (i.e. `NA`)? – niko Feb 01 '19 at 19:29
  • Hopefully addressed both parts of the comment, thanks for pointing out re: dput and else! – Lynsey Feb 01 '19 at 19:40
  • Perfect. One last question: are you looking *specifically* for a solution using `data.table::shift` or for general solutions for the task? – niko Feb 01 '19 at 19:47
  • Not specifically! I just thought it was a logical starting point as it was how I had accessed the previous row when generating other columns. – Lynsey Feb 01 '19 at 19:56

2 Answers2

2

Here is a solution that solves the task in base R though - data.table is not used here.

# logical vector with our condition tested
ind <- (diff(DF$CHR) == 0 & DF$KBdist[-1] < 5e+5)
# populating the 'locus' column   ---   notice the '<<-'
vapply(2:nrow(DF), function (k) DF$locus[k] <<- DF$locus[k-1] + 1 - ind[k-1], numeric(1)) 
# [1] 1 1 2 3 3
DF
#   CHR        SNP      BP  KBdist locus
# 1   1  rs2494631 2399149      NA     1
# 2   1  rs4648637 2401364    2215     1
# 3   1  rs2494627 2402499    1135     1
# 4   1 rs11122119 6768856 4366357     2
# 5   1  rs1844583 8383469 1614613     3
# 6   1  rs2292242 8385059    1590     3

vapply(...) returns the locus column and overwrites it.

Remark

Note that I used <<- inside the function in order to overwrite the DF$locus[k] value. If you don't like this aspect, simply swap <<- for <- and subsitute vapply(...) with DF$locus[-1] <- vapply(...).

niko
  • 5,253
  • 1
  • 12
  • 32
  • This works a treat! I am just picking through it and processing what it is doing, as I've not come across function (k) before. This is super clever! Thank you for making my Friday night better :) – Lynsey Feb 01 '19 at 20:07
  • @Lynsey please [accept as answer](https://stackoverflow.com/help/someone-answers) if this is the solution, so we have a closure for your question. – zx8754 Feb 04 '19 at 13:22
1

Another possibility is using cumsum:

setDT(DF)[, locus := cumsum(c(1L, (CHR!=shift(CHR,1L) | KBdist>=500e3)[-1L]))]

output:

   CHR        SNP      BP  KBdist locus
1:   1  rs2494631 2399149      NA     1
2:   1  rs4648637 2401364    2215     1
3:   1  rs2494627 2402499    1135     1
4:   1 rs11122119 6768856 4366357     2
5:   1  rs1844583 8383469 1614613     3
6:   1  rs2292242 8385059    1590     3
chinsoon12
  • 25,005
  • 4
  • 25
  • 35