1

I am trying to work out how I can generate new columns based on the first and last instance of a column value. My data looks like this:

DF <- structure(list(CHR = c(1, 1, 1, 1, 1, 1), 
SNP = c("rs2494631", "rs4648637", "rs2494627", "rs11122119", "rs1844583","rs2292242"), 
BP = c(2399149, 2401364, 2402499, 6768856, 8383469, 8385059), 
KBdist= c(NA, 2215, 1135, 4366357, 1614613, 1590), 
locus = c(1, 1, 1, 2, 3, 3)), 
.Names = c("CHR","SNP","BP","KBdist","locus"), 
row.names = c(NA, 6L), 
class = "data.frame")

> df

CHR SNP        BP       KBdist   locus
1   rs2494631  2399149  NA       1
1   rs4648637  2401364  2215     1
1   rs2494627  2402499  1135     1
1   rs11122119 6768856  4366357  2
1   rs1844583  8383469  1614613  3
1   rs2292242  8385059  1590     3

and what I am trying to achieve is: "If locus is the same, make start the same as BP in the first instance of that locus, and make stop the same as BP in the last instance of that locus". Which would yield an output that looks like this:

CHR SNP        BP       KBdist   locus    start    stop
1   rs2494631  2399149  NA       1        2399149  2402499
1   rs4648637  2401364  2215     1        2399149  2402499
1   rs2494627  2402499  1135     1        2399149  2402499
1   rs11122119 6768856  4366357  2        6768856  6768856
1   rs1844583  8383469  1614613  3        8383469  8385059
1   rs2292242  8385059  1590     3        8383469  8385059

I have been playing around with the answer to a similar question I posed: Combining an ifelse statement with shift data.table function in R

and with the shift function for data.table in R, but to no avail. Any help would be greatly appreciated!

Thanks.

Lynsey
  • 339
  • 1
  • 2
  • 11
  • 2
    with `dplyr`, you can do `DF %>% group_by(locus) %>% mutate(start = first(BP), stop = last(BP)) ` – Ronak Shah Feb 04 '19 at 13:06
  • @RonakShah just saw your comment... – patL Feb 04 '19 at 13:09
  • 1
    @patL no issues. :-) This happens all the time on SO.. – Ronak Shah Feb 04 '19 at 13:10
  • Thank you both for your help. My apologies, I didn't realise this was a duplicate. I did do a good Google/SO trawl before posting and didn't find anything that answered my Q. Whereas now I have asked this Q, suggested posts in the sidebar include: https://stackoverflow.com/questions/42654219/get-first-and-last-values-in-group-dplyr-group-by-with-last-and-first?noredirect=1&lq=1 – Lynsey Feb 04 '19 at 13:38
  • @Lynsey just to clarify: Do you want the first and last instance of BP by locus or actually the min max by locus? Proposed answer works in this case, but when you change the row order you would get a different result. – mmn Feb 04 '19 at 13:51

1 Answers1

1

You can use dplyr to accomplish it:

library(dplyr)

dat %>% 
  group_by(locus) %>% 
  mutate(start = first(BP),
         stop = last(BP))

which gives:

## A tibble: 6 x 7
## Groups:   locus [3]
#    CHR SNP             BP  KBdist locus   start    stop
#  <int> <fct>        <int>   <int> <int>   <int>   <int>
#1     1 rs2494631  2399149      NA     1 2399149 2402499
#2     1 rs4648637  2401364    2215     1 2399149 2402499
#3     1 rs2494627  2402499    1135     1 2399149 2402499
#4     1 rs11122119 6768856 4366357     2 6768856 6768856
#5     1 rs1844583  8383469 1614613     3 8383469 8385059
#6     1 rs2292242  8385059    1590     3 8383469 8385059

the data:

dat <- read.table(header = TRUE, 
                  text = "
                  CHR SNP        BP       KBdist   locus
1   rs2494631  2399149  NA       1
1   rs4648637  2401364  2215     1
1   rs2494627  2402499  1135     1
1   rs11122119 6768856  4366357  2
1   rs1844583  8383469  1614613  3
1   rs2292242  8385059  1590     3")
patL
  • 2,259
  • 1
  • 17
  • 38