0

this is my first time here, so my apologies if I made something wrong or confounding. I am working with genomic data, and I have two data frames: One of them is information about the Ancestry of a range of SNPs (see table below):

Chrom | Start    | End      | Ancestry
---------------------------------------
 22   | 16495833 | 19868218 | EUR_Patag
 22   | 19873357 | 21405110 | Patag_Patag
 22   | 21416404 | 21449724 | Patag_UNK
 22   | 21458082 | 23704421 | EUR_Patag
 22   | 23712647 | 23717466 | Patag_UNK

The other data frame contain information about the phased genotype for each rsID (see table below):

Chrom | Pos      | ID       | Genot
---------------------------------------
 22   | 16495833 | rs116823 | 0|1
 22   | 16620701 | rs635455 | 0|0
 22   | 16648658 | rs445724 | 1|1
 22   | 16872459 | rs827345 | 1|0
 22   | 16880098 | rs309287 | 1|1

So, I want to analyse each SNPs from the second data frame (through the "Pos" column) with the first data frame, an evaluate the range which this specific SNPs is located and assign with a new column (in the second data frame) the Ancestry (according the range in which the SNP is located).

I was searching for a solution, and I found that the library Data Table in R is able to attend this issue, but unfortunately I was not able to find a solution.

I hope a well understanding for my question. Thank you very much for your help

  • Could you have a look at https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 and provide your data in a reproducible format? I think data.table does work, something like `DT2[, a := DT1[DT2, on=.(something), Ancestry]]` – Frank Jul 28 '17 at 20:35

1 Answers1

0

I think we can do this as an overlap join. An overlap join joins over an interval, and in this case we have an interval and a position. A position is technically an interval of length 0, so we can still do it.

require(data.table)

setDT(df1)
setDT(df2)

#create a duplicate of the position so we have our "interval"
df2[, Pos2:= Pos]

setkey(df1, Start, End)

outDF <- foverlaps(df2,df1, by.x=c("Pos","Pos2"),by.y=c("Start","End"),
  type="within", mult="all")

Output (added another row 22 23712650 rs309884 1|1 23712650 to df2 to further validate)

   Chrom    Start      End  Ancestry V1      Pos       ID Genot    Pos2
1:    22 16495833 19868218 EUR_Patag 22 16495833 rs116823 0|1  16495833
2:    22 16495833 19868218 EUR_Patag 22 16620701 rs635455 0|0  16620701
3:    22 16495833 19868218 EUR_Patag 22 16648658 rs445724 1|1  16648658
4:    22 16495833 19868218 EUR_Patag 22 16872459 rs827345 1|0  16872459
5:    22 16495833 19868218 EUR_Patag 22 16880098 rs309287 1|1  16880098
6:    22 23712647 23717466 Patag_UNK 22 23712650 rs309884 1|1  23712650

df1 and df2 correspond to the order you listed them.

Example I used for reference here

Mako212
  • 6,787
  • 1
  • 18
  • 37
  • Hi, thank you for your answer, I apply the commands, but appear an error: "Error in foverlaps(data1, rfmix, by.x = names(data1), type = "within", : length(by.x) != length(by.y). Columns specified in by.x should correspond to columns specified in by.y and should be of same lengths". So, I forgot to mentions this but, the data frames have not the same lengths. df1 is shorter than df2. – Patricio Pezo Valderrama Jul 31 '17 at 17:00
  • @PatricioPezoValderrama I made a couple edits to my solution, see if that helps. The data frames don't have to be the same length, I made a couple of mistakes in my original solution. – Mako212 Jul 31 '17 at 17:21
  • It works perfect, thank you very much for your answer. – Patricio Pezo Valderrama Aug 14 '17 at 20:59