-2

I have three data frames (Forest, Agriculture, and Urban) with each having 1 row and 24145 columns (see example at the bottom). Each column represents a different molecular formula and the value within each cell corresponds to the relative amount of that formula in the sample (Forest, Agriculture, and Urban).

I'm trying to figure out the best way to find which molecular formulae are unique to each of the three samples above. For example, if I have one molecular formula (C10H10) that has a value of 0.12 for Forest but 0 for both Agriculture and Urban, I want to be able to obtain a final product that shows that particular formula was only present in the Forest sample.

Ultimately, I want to then make a plot with this final product where I can plot the molecular formula information on the axes (ratio of oxygen to carbon on the x and ratio of hydrogen to carbon on the y) and have individual points within the plot corresponding to those unique formulae, color coded to represent which sample they were uniquely found in.

Thanks in advance!

Small example of the input, with the three separate data frames combined into one called Samples (input in reality has 24145 different molecular formulae, not just the 4 listed here):

              C10H10O3N1S0   C10H1004N1S0    C10H10O5N1S0  C10H10O5N1S1
Forest        0.00           1.44            0.00          0.00
Agriculture   0.00           0.00            1.11          4.94
Urban         1.29           0.00            1.33          0.00
  • 2
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Jul 14 '20 at 19:04
  • If I get correctly what you want, you may to use the `anti_join` method from dplyr. It basically only keeps rows that do not match between two dataframes. – Daniel R Jul 14 '20 at 19:16
  • @DanielR thanks for that suggestion, reading up on anti_join, would that just give me places where their relative abundances are different (i.e. having a value of 0.12 for Agriculture vs. having a value of 0.13 for Urban)? Ideally I would want something that would only give me the formula in which two of the three data frames (Forest/Ag/Urban) have 0 abundance for one formulae while the third has an abundance greater than 0. – Derrick Vaughn Jul 15 '20 at 00:30
  • @MrFlick I didn't think an example was needed but I'll edit the post to include a small example of the input. This is my first time doing this sort of thing so I'm not exactly sure what my desired output is. Ideally I want something that tells me which formulae are unique to the forest sample, which formulae are unique to the agriculture sample, and which formulae are unique to the urban sample. – Derrick Vaughn Jul 15 '20 at 00:32

1 Answers1

0

With example data:

df <- data.frame(Forest=0:20,Agriculture=10:30,Urban=c(10:20,51:60))

   Forest Agriculture Urban
1       0          10    10
2       1          11    11
3       2          12    12
4       3          13    13
5       4          14    14
6       5          15    15
7       6          16    16
8       7          17    17
9       8          18    18
10      9          19    19
11     10          20    20
12     11          21    51
13     12          22    52
14     13          23    53
15     14          24    54
16     15          25    55
17     16          26    56
18     17          27    57
19     18          28    58
20     19          29    59
21     20          30    60

We can do something like this

uniquevals <- list()

for(i in 1:ncol(df)){
 uniquevals[[i]] <- df[,i][rowSums(apply(df[,-i],2, function(x) df[,i] %in% x)) == 0]
}

names(uniquevals) <- colnames(df)

to obtain a list for each

> uniquevals
$Forest
 [1] 0 1 2 3 4 5 6 7 8 9

$Agriculture
 [1] 21 22 23 24 25 26 27 28 29 30

$Urban
 [1] 51 52 53 54 55 56 57 58 59 60
Daniel O
  • 4,258
  • 6
  • 20
  • Hi Daniel, thanks for this suggestion! Going through your example, it appears that this gives the values that are different between the three data frames. I ideally would like to find only the variables (formulae in this example) in which two of the three data frames have a zero value for that variable (formula) and the third has a non-zero value. I've added an example of my input in the original question if that helps. Thanks again for the help! – Derrick Vaughn Jul 15 '20 at 00:43
  • @DerrickVaughn, you might want something as simple as `df$Forest[df$Forest == rowSums(df)]` – Daniel O Jul 15 '20 at 11:02
  • that worked, thanks for that! This ended up giving me the values of the cells though, is there any way I would be able to get which position (e.g. row 1, row 100, row 111) it belongs to? That way I could assign it the molecular formula it that the row number corresponds to. – Derrick Vaughn Jul 15 '20 at 12:55
  • Yes, `which(df$Forest == rowSums(df))` – Daniel O Jul 15 '20 at 13:10
  • Perfect! Thanks for you help! – Derrick Vaughn Jul 15 '20 at 14:48
  • Using dataframe df and the which function you described above, I found rows 5, 10, 16, and 19 were the only rows in which the molecular formulae were present in the forest sample but not the other two. I ended up calling this Forest_unique: ```Forest_unique<-which(df$Forest == rowSums(df))```). Is there any way I can take Forest_unique and apply it to another dataframe? The other dataframe I would like to apply it to is called chem, with the rows corresponding to the same molecular formulae in the original data frame and columns for HC (hydrogen to carbon ratio) and OC (oxygen to carbon ratio) – Derrick Vaughn Jul 15 '20 at 15:34
  • you can subset data frames with `df[rows,columns]` in your case, sounds like you want to use `chem[Forest_unique,]` Note that selecting no values for the columns will actually select them all. you can even skip a couple steps if you want to directly use `chem[df$Forest == rowSums(df),]` This only works if, as you've said, the two dataframes have coresponding rows. – Daniel O Jul 15 '20 at 16:39