I am working with DNA methylation data from a microarray. Each 'probe' in the array has multiple genes associated with it, There are also multiple probes in each gene. Here is a short example:
|probe | P.Value| adj.P.Val| Dbeta|UCSC_REFGENE_NAME |
|:----------|-------:|---------:|----------:|:--------------------------|
|cg23516680 | 2e-07| 0.0003419| -0.0172609|LYST |
|cg02390624 | 2e-07| 0.0003419| 0.0170831|SYTL2;SYTL2;SYTL2 |
|cg08808720 | 2e-07| 0.0003424| -0.0129818|KIF5C;MIR1978 |
|cg12074090 | 2e-07| 0.0003300| -0.0169523|ANGPT2;ANGPT2;ANGPT2;MCPH1 |
|cg10376100 | 1e-07| 0.0002714| 0.0172562|LYST;MIR1537 |
What I'd like to do is make groups according to ANY of the character names (genes) that appear in the UCSC_REFGENE_NAME column (e.g. one group would be all probes associated with the gene LYST, and another all in MIR1537)
Points:
- I know this will result in a single probe/row occurring >1 (LYST and MIR1537 should both be groups that include cg10376100)
- I do not want the same probe to appear >1 for the same gene (e.g. cg12074090 should occur only once for ANGPT2).
Suggestions?