1

Thank you in advance for the help.

I am trying to recode a genetic database that contains genotypes coded in VCF format. For context, the VCF format is coded in this format: '0|0:0,0:0:1,0,0'. The main thing I am interested in is the first two(/three if including the |) characters: 0|0:0,0:0:1,0,0. If these are 0|0, it means that the person has two dominant alleles. IF these are 1|1, two recessive alleles. 1|0 and 0|1 are a mix of the two.

I am working on a data frame called "gg" that contains approx 120 columns (one for each SNP) and 1500 rows (one for each subject in the study).

I am trying to recode the SNP from its current format to a more easily analysable format:

  • 0|0 = two dominant alleles - recode as 0
  • 0|1 or 1|0 = mix of one dominant one recessive - recode as 1
  • 1|1= two recessive - recode as 2

I have attempted several approaches. The latest thing I have attempted has got close-ish. I tried the following:

gg[grep("0|0", gg)] <- "0"

Weirdly this makes all the values for the WHOLE database 0's. I think this is because it is interpreting the 0|0 as 'if the value contains a zero or a zero, recode as zero' (and all values contain at least one zero).

What I want to convey is to recode as 1 if the value starts with the EXACT characters 0|0, recode as 1 if it starts with the EXACT characters of 0|1 or 1|0, recode as 2 if it starts with the EXACT character of 1|1

  • The regex `"0|0"` does mean "0 or 0". If you want to match that exact string, you can either escape the `|`: `"0\\|0"`, or add `fixed = TRUE` to `grep()` to not use exact matching (not regex). But a more robust solution would be to first [tidy](https://r4ds.had.co.nz/tidy-data.html) your data so that 1 row = 1 SNP, then split the VCF-coded column into individual observations that you can recode more precisely. – Joe Roe Feb 08 '21 at 14:50
  • you need to specify the columns, for example if you wanna replace the first column ```gg[gg == "0|0", 1] <- "0"``` this is quick and easy fix for you – StupidWolf Feb 08 '21 at 15:28
  • For a more comprehensive solution, can you please provide an example of the data.frame or matrix you have? you can do ```dput(head(df))``` and paste the output.. see https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – StupidWolf Feb 08 '21 at 15:31

2 Answers2

1

Try the code below

colSums(list2DF(strsplit(substr(gsub("\\|","",gg),1,2),""))=="1")

which gives

0 1 1 2

Dummy Data

gg <- c('0|0:0,0:0:1,0,0','10:0,0:0:1,0,0','0|1:0,0:0:1,0,0','11:0,0:0:1,0,0')
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
1

A slightly modified option is

rowSums(read.csv(text = sub("^(\\d)\\|?(\\d).*", "\\1,\\2", gg), 
         header = FALSE) == 1)
#[1] 0 1 1 2

data

gg <- c('0|0:0,0:0:1,0,0','10:0,0:0:1,0,0','0|1:0,0:0:1,0,0','11:0,0:0:1,0,0')
akrun
  • 874,273
  • 37
  • 540
  • 662