0

Have df as seen below

df <- read.table(text="name id_final    id1 id2 id3
sample1 10.96311    4.767571    3.692556    2.966773
sample2 10.83782    11.61998    11.402257   10.301068
sample3 13.98669    12.123346   10.299306   8.85533
sample4 13.97313    12.200774   11.874366   11.013115
sample5 13.89532    10.712515   9.102278    9.832699
sample6 13.86255    11.808834   9.180613    8.813621", header=T, sep='\t')
head(df)
> head(df)
     name id_final       id1       id2       id3
1 sample1 10.96311  4.767571  3.692556  2.966773
2 sample2 10.83782 11.619980 11.402257 10.301068
3 sample3 13.98669 12.123346 10.299306  8.855330
4 sample4 13.97313 12.200774 11.874366 11.013115
5 sample5 13.89532 10.712515  9.102278  9.832699
6 sample6 13.86255 11.808834  9.180613  8.813621

need to do some basic math divide each column with the id_final column and create new columns with_log as suffix this can be done with a simple mutate, as seen below.

df <- df %>%
  mutate(id1_log = log2(id1/id_final),
         id2_log = log2(id2/id_final),
         id3_log = log2(id3/id_final))
head(df)
> head(df)
     name id_final       id1       id2       id3    id1_log     id2_log     id3_log
1 sample1 10.96311  4.767571  3.692556  2.966773 -1.2013308 -1.56996541 -1.88569067
2 sample2 10.83782 11.619980 11.402257 10.301068  0.1005330  0.07324483 -0.07328067
3 sample3 13.98669 12.123346 10.299306  8.855330 -0.2062667 -0.44150746 -0.65943661
4 sample4 13.97313 12.200774 11.874366 11.013115 -0.1956825 -0.23480474 -0.34343264
5 sample5 13.89532 10.712515  9.102278  9.832699 -0.3753018 -0.61029950 -0.49893967
6 sample6 13.86255 11.808834  9.180613  8.813621 -0.2313261 -0.59453027 -0.65338590

In the given example its easy if there are only 3 columns, how would i automate this in case i have more than 3, typing this every time is not elegant.

mutate(id1_log = log2(id1/id_final),
          id2_log = log2(id2/id_final),
          id3_log = log2(id3/id_final))

TO give a bigger picture i am trying to write a function that i could use over several files with multiple id1...n columns

2 Answers2

4

Could do:

library(dplyr)

df %>% mutate_at(vars(matches("id\\d+$")), list(log = ~ log2(. / id_final)))

We mutate (at once with mutate_at) the desired columns - these are all that match the regex id\\d+$ which basically matches column names which end with a number and are preceded by id (e.g. to avoid capturing id_final or any other id_.. columns.

Afterwards we provide a list with the desired transformations. You can provide a name to a transformation, and this name is then automatically appended to the column name. We say log, so columns automatically get _log at the end; you could write anything else there.

If you don't provide a name, the already existing columns will be modified; if you do, you'll get additional ones like in our case.

Output:

     name id_final       id1       id2       id3    id1_log     id2_log     id3_log
1 sample1 10.96311  4.767571  3.692556  2.966773 -1.2013308 -1.56996541 -1.88569067
2 sample2 10.83782 11.619980 11.402257 10.301068  0.1005330  0.07324483 -0.07328067
3 sample3 13.98669 12.123346 10.299306  8.855330 -0.2062667 -0.44150746 -0.65943661
4 sample4 13.97313 12.200774 11.874366 11.013115 -0.1956825 -0.23480474 -0.34343264
5 sample5 13.89532 10.712515  9.102278  9.832699 -0.3753018 -0.61029950 -0.49893967
6 sample6 13.86255 11.808834  9.180613  8.813621 -0.2313261 -0.59453027 -0.65338590
arg0naut91
  • 14,574
  • 2
  • 17
  • 38
1

And here is a data.table option:

library(data.table)
cols <- names(df)[3:5] # first, select columns you are interested in (or names(df)[grepl("id\\d+$", names(df))])
setDT(df)[, paste(cols, "log", sep = "_") :=  lapply(.SD, function(x) log2(x/id_final)),
          .SDcols = cols][] # apply { function(x) log2(x/id_final) } to selected columns
# output
      name id_final       id1       id2       id3    id1_log     id2_log     id3_log
1: sample1 10.96311  4.767571  3.692556  2.966773 -1.2013308 -1.56996541 -1.88569067
2: sample2 10.83782 11.619980 11.402257 10.301068  0.1005330  0.07324483 -0.07328067
3: sample3 13.98669 12.123346 10.299306  8.855330 -0.2062667 -0.44150746 -0.65943661
4: sample4 13.97313 12.200774 11.874366 11.013115 -0.1956825 -0.23480474 -0.34343264
5: sample5 13.89532 10.712515  9.102278  9.832699 -0.3753018 -0.61029950 -0.49893967
6: sample6 13.86255 11.808834  9.180613  8.813621 -0.2313261 -0.59453027 -0.65338590
nghauran
  • 6,648
  • 2
  • 20
  • 29
  • This is good but with thousands of columns, the first part becomes tiresome. Perhaps some "automated" way to extract the target columns? – NelsonGon Feb 28 '19 at 18:25
  • 1
    Sure, you're right. Depending on OP data, selecting targeted columns would be more or less tiresome but [there are many ways to do that](https://stackoverflow.com/questions/18587334/subset-data-to-contain-only-columns-whose-names-match-a-condition). The point here is to see that there is a necessary step consisting in the selection of targeted columns, no matter if column names are consistent or not – nghauran Feb 28 '19 at 18:34