0

I am working with DNA methylation data from a microarray. Each 'probe' in the array has multiple genes associated with it, There are also multiple probes in each gene. Here is a short example:

|probe      | P.Value| adj.P.Val|      Dbeta|UCSC_REFGENE_NAME          |
|:----------|-------:|---------:|----------:|:--------------------------|
|cg23516680 |   2e-07| 0.0003419| -0.0172609|LYST                       |
|cg02390624 |   2e-07| 0.0003419|  0.0170831|SYTL2;SYTL2;SYTL2          |
|cg08808720 |   2e-07| 0.0003424| -0.0129818|KIF5C;MIR1978              |
|cg12074090 |   2e-07| 0.0003300| -0.0169523|ANGPT2;ANGPT2;ANGPT2;MCPH1 |
|cg10376100 |   1e-07| 0.0002714|  0.0172562|LYST;MIR1537               |

What I'd like to do is make groups according to ANY of the character names (genes) that appear in the UCSC_REFGENE_NAME column (e.g. one group would be all probes associated with the gene LYST, and another all in MIR1537)

Points:

  • I know this will result in a single probe/row occurring >1 (LYST and MIR1537 should both be groups that include cg10376100)
  • I do not want the same probe to appear >1 for the same gene (e.g. cg12074090 should occur only once for ANGPT2).

Suggestions?

Calen
  • 305
  • 4
  • 17
  • 1
    I think you'd be best off making a separate table with two columns: `probe` and `UCSC_REFGENE_NAME`, that would run down the page. See https://stackoverflow.com/questions/13773770/split-comma-separated-column-into-separate-rows for how to split the strings into separate rows. – thelatemail Nov 03 '17 at 22:13
  • That's helpful and yes, simple. Thank you. – Calen Nov 04 '17 at 01:09

1 Answers1

3

Expanding on @thelatemail's comment, you can use tidyr::separate_rows to create one row for each individual entry in the UCSC_REFGENE_NAME column. Next you can remove the duplicate entries with dplyr::distinct.

library(dplyr)
library(tidyr)

df %>% 
  separate_rows(UCSC_REFGENE_NAME, sep = ";") %>%
  distinct()

#>        probe P.Value adj.P.Val      Dbeta UCSC_REFGENE_NAME
#> 1 cg23516680   2e-07 0.0003419 -0.0172609              LYST
#> 2 cg02390624   2e-07 0.0003419  0.0170831             SYTL2
#> 3 cg08808720   2e-07 0.0003424 -0.0129818             KIF5C
#> 4 cg08808720   2e-07 0.0003424 -0.0129818           MIR1978
#> 5 cg12074090   2e-07 0.0003300 -0.0169523            ANGPT2
#> 6 cg12074090   2e-07 0.0003300 -0.0169523             MCPH1
#> 7 cg10376100   1e-07 0.0002714  0.0172562              LYST
#> 8 cg10376100   1e-07 0.0002714  0.0172562           MIR1537

Data used

txt = " |probe      | P.Value| adj.P.Val|      Dbeta|UCSC_REFGENE_NAME          |
  |cg23516680 |   2e-07| 0.0003419| -0.0172609|LYST                       |
  |cg02390624 |   2e-07| 0.0003419|  0.0170831|SYTL2;SYTL2;SYTL2          |
  |cg08808720 |   2e-07| 0.0003424| -0.0129818|KIF5C;MIR1978              |
  |cg12074090 |   2e-07| 0.0003300| -0.0169523|ANGPT2;ANGPT2;ANGPT2;MCPH1 |
  |cg10376100 |   1e-07| 0.0002714|  0.0172562|LYST;MIR1537               |"

df <- read.table(text = stringr::str_replace_all(txt, "\\|", " "),
           header = TRUE, stringsAsFactors = FALSE)
markdly
  • 4,394
  • 2
  • 19
  • 27
  • Yes that works! Thanks for helping me get my head around this. I'd give you a plus one but I'm a Stackoverflow wimp so I can't yet. – Calen Nov 04 '17 at 01:08