dplyr update original data frame with new values after subset

Question

I am trying to change the value of entries in a dataframe column based on a series of conditions. I need to change the 'group' values of the top (or bottom) 10 entries of a certain type.

My data is in a dataframe that looks like this:

> head(diff_df_min)
  external_gene_name   gene_biotype  Fold  p.value     group
1      RP11-431K24.1        lincRNA -4.13 4.86e-06 signif_fc
2              UBE4B protein_coding  2.42 3.91e-06 signif_fc
3             UBIAD1 protein_coding  2.74 5.58e-05 signif_fc
4             PTCHD2 protein_coding  3.37 2.68e-06 signif_fc
5             DRAXIN protein_coding  3.04 1.42e-06 signif_fc
6             VPS13D protein_coding  4.26 1.60e-07 signif_fc

> dim(diff_df_min)
[1] 1824    5

I have figured out this solution with dplyr:

diff_df_min %>%
        filter(gene_biotype == "protein_coding") %>% # subset for protein coding genes
        arrange(-Fold, p.value) %>% # Sort by Fold change, then by p value
        slice(1:10) %>% # take the top 10 entries... 
        mutate(group = "top_signif_fc") # ... and change the "group" column value to "top_signif_fc"

This gives the exact results I want:

   external_gene_name   gene_biotype Fold  p.value         group
1               CROCC protein_coding 5.46 3.44e-14 top_signif_fc
2               KCNA2 protein_coding 5.43 2.08e-11 top_signif_fc
3             PITPNC1 protein_coding 5.32 8.16e-11 top_signif_fc
4                RRP8 protein_coding 5.31 1.01e-10 top_signif_fc
5             HEPACAM protein_coding 5.27 1.26e-10 top_signif_fc
6              SGK223 protein_coding 5.14 3.45e-15 top_signif_fc
7               DDX3Y protein_coding 5.03 1.82e-09 top_signif_fc
8            ARHGAP10 protein_coding 4.99 2.83e-09 top_signif_fc
9              RNF180 protein_coding 4.98 3.19e-09 top_signif_fc
10              CSPG5 protein_coding 4.97 9.92e-12 top_signif_fc

Except this is not updating these values in the original dataframe, it is only showing the results after applying the functions. Similarly, I have tried to do the same in data.table and figured out this method:

setDT(diff_df_min,key = "external_gene_name")
diff_df_min[gene_biotype == "protein_coding"][order(-Fold, p.value), head(.SD, 10)][,group := "top_signif_fc"]

But again this only RETURNS the results, it does not update the original dataframe.

    external_gene_name   gene_biotype Fold  p.value         group
 1:              CROCC protein_coding 5.46 3.44e-14 top_signif_fc
 2:              KCNA2 protein_coding 5.43 2.08e-11 top_signif_fc
 3:            PITPNC1 protein_coding 5.32 8.16e-11 top_signif_fc
 4:               RRP8 protein_coding 5.31 1.01e-10 top_signif_fc
 5:            HEPACAM protein_coding 5.27 1.26e-10 top_signif_fc
 6:             SGK223 protein_coding 5.14 3.45e-15 top_signif_fc
 7:              DDX3Y protein_coding 5.03 1.82e-09 top_signif_fc
 8:           ARHGAP10 protein_coding 4.99 2.83e-09 top_signif_fc
 9:             RNF180 protein_coding 4.98 3.19e-09 top_signif_fc
10:              CSPG5 protein_coding 4.97 9.92e-12 top_signif_fc

You can see this when you check the values in the data frame after running any of these commands (or run a subset of the commands again):

> diff_df_min[which(diff_df_min['external_gene_name'] == "CROCC"),]
    external_gene_name   gene_biotype Fold  p.value     group
372              CROCC protein_coding 5.46 3.44e-14 signif_fc

And of course, if you try to use either of the methods like this:

diff_df_min <- ...

You end up overwriting the original dataframe with only the 10 lines that were selected with dplyr or data.table.

I had previously been doing similar things in base R, but could not get this case to work. I tried it, and ended up with this, which is ridiculous and does not work correctly:

diff_df_min[with(diff_df_min[which(diff_df_min['gene_biotype'] == "protein_coding"),], order(-Fold, p.value) ),"group"][1:top_gene_number] <- "top_signif_fc"

^^ Somewhere along the way, the indexes get messed up and so the entries that are eventually changed are not the intended entries.

I have read dozens and dozens of pages about this so far, including many tutorials and even this but so far I have been unable to find anything that actually gives a solution for this. I don't want to simply print out a modified dataframe, I want to update the original dataframe entries with the new entries.

akrun · Accepted Answer · 2016-09-16T02:59:04.170

4

We can use an ifelse statement to make the changes instead of slice to subset it and also replace the filter (that removes the rows) with arrangeing based on the "protein_coding" as well, and also assign the the output back to the original dataset or to a new

diff_df_minNew <- diff_df_min %>%
                     arrange(desc(gene_biotype == "protein_coding"), 
                                desc(Fold), p.value) %>% 
                     mutate(group = ifelse(row_number() < 11, "top_signif_fc", group))

A corresponding option using data.table would be

library(data.table)
diff_df_minNew2 <- setDT(diff_df_min)[order(-(gene_biotype=="protein_coding"),
      -Fold, p.value)][seq_len(10), group := "top_signif_fc"][]

edited Sep 16 '16 at 02:59

answered Sep 16 '16 at 02:32

akrun

874,273
37
540
662

would the filter statement still not make this override the original? – zacdav Sep 16 '16 at 02:34
@zacdave changed the `filter` step – akrun Sep 16 '16 at 02:35

dplyr update original data frame with new values after subset

1 Answers1