2

I have a data set where two columns contain values that are reciprocal. That is if you could flip some of the values in the row only one of them the rows would be identical. I'm wondering if there is a way to filter such rows, keeping only one.

ds <- structure(list(gene_a = c("CACNA2D4", "CTNND2", "GCN1L1", "ROBO2", 
"MLL2", "ZNF521", "ITPR3", "STAB1", "DSP", "ZNF676", "LAMC1", 
"NLRP2", "PCDHGA10", "PRDM16", "PTPRB", "PXDN", "CTNND2", "FBN3", 
"KIF20B", "MYOF"), gene_a_freq = c(0.0303030303030303, 0.0303030303030303, 
0.0656565656565657, 0.0454545454545455, 0.0555555555555556, 0.0353535353535354, 
0.0404040404040404, 0.0353535353535354, 0.0303030303030303, 0.0353535353535354, 
0.0303030303030303, 0.0404040404040404, 0.0303030303030303, 0.0303030303030303, 
0.0303030303030303, 0.0303030303030303, 0.0303030303030303, 0.0353535353535354, 
0.0303030303030303, 0.0353535353535354), gene_b = c("CTNND2", 
"CACNA2D4", "ROBO2", "GCN1L1", "ZNF521", "MLL2", "STAB1", "ITPR3", 
"ZNF676", "DSP", "PTPRB", "PRDM16", "PXDN", "NLRP2", "LAMC1", 
"PCDHGA10", "FBN3", "CTNND2", "MYOF", "KIF20B"), gene_b_freq = c(0.0303030303030303, 
0.0303030303030303, 0.0454545454545455, 0.0656565656565657, 0.0353535353535354, 
0.0555555555555556, 0.0353535353535354, 0.0404040404040404, 0.0353535353535354, 
0.0303030303030303, 0.0303030303030303, 0.0303030303030303, 0.0303030303030303, 
0.0404040404040404, 0.0303030303030303, 0.0303030303030303, 0.0353535353535354, 
0.0303030303030303, 0.0353535353535354, 0.0303030303030303)), .Names = c("gene_a", 
"gene_a_freq", "gene_b", "gene_b_freq"), row.names = c(NA, 20L
), class = "data.frame")

For example below, in row 2 if you swapped gene_a with gene_b and gene_a_freq with gene_b_freq the row 2 would be the same as row 1. The cases aren't always in adjacent rows. I'd like to be able to only keep one of the two, so in this example drop row 2 keeping row 1.

 gene_a gene_a_freq   gene_b gene_b_freq
1  CACNA2D4  0.03030303   CTNND2  0.03030303
2    CTNND2  0.03030303 CACNA2D4  0.03030303
3    GCN1L1  0.06565657    ROBO2  0.04545455
4     ROBO2  0.04545455   GCN1L1  0.06565657
5      MLL2  0.05555556   ZNF521  0.03535354
6    ZNF521  0.03535354     MLL2  0.05555556
7     ITPR3  0.04040404    STAB1  0.03535354
8     STAB1  0.03535354    ITPR3  0.04040404
9       DSP  0.03030303   ZNF676  0.03535354
10   ZNF676  0.03535354      DSP  0.03030303
11    LAMC1  0.03030303    PTPRB  0.03030303
12    NLRP2  0.04040404   PRDM16  0.03030303
13 PCDHGA10  0.03030303     PXDN  0.03030303
14   PRDM16  0.03030303    NLRP2  0.04040404
15    PTPRB  0.03030303    LAMC1  0.03030303
16     PXDN  0.03030303 PCDHGA10  0.03030303
17   CTNND2  0.03030303     FBN3  0.03535354
18     FBN3  0.03535354   CTNND2  0.03030303
19   KIF20B  0.03030303     MYOF  0.03535354
20     MYOF  0.03535354   KIF20B  0.03030303

Thanks

jraab
  • 413
  • 4
  • 10
  • It looks like what you're after is sort of a network. Perhaps check out [this](http://sites.stat.psu.edu/~dhunter/Rnetworks/) intro to the `network` package which may prove a big help. – MichaelChirico Aug 02 '15 at 20:17
  • 1
    Seems like you are looking for something like `ds[c("gene_a", "gene_b")] <- t(apply(ds, 1, function(x) sort(x[c("gene_a", "gene_b")]))) ; unique(ds)` – David Arenburg Aug 02 '15 at 20:23
  • Thanks for the response. The linked answer above is very helpful. Missed it in my searches. – jraab Aug 02 '15 at 20:32

0 Answers0