1

I have a dataframe called "wc" in the following format

PARENT_MOL_CHEMBL_ID    TARGET_TYPE
C10                      ABL 
C10                      EGFR
C939                     TP53

I want to reshape the matrix in the following format: DESIRED OUTPUT

PARENT_MOL_CHEMBL_ID  ABL  EGFR  TP53
C10                     1    1     0
C939                    0    0     1

I have read reshape package in R. I tried do call the cast function but it does not work.

cast(wc,PARENT_MOL_CHEMBL_ID~TARGET_TYPE, value = "TARGET_TYPE")

I do not understand what argument should I use in the value section. Kindly guide me how to convert my data.frame in the desired shape.

If there is a better alternate like plyr function, kindly highlight alternatives as well. Thank you

Uwe
  • 41,420
  • 11
  • 90
  • 134
KHAN irfan
  • 253
  • 2
  • 9
  • jogo It is different I want to convert the data in binary format to show which of the targets are present and which are absent. Kindly understand the question. – KHAN irfan Apr 05 '17 at 09:25

2 Answers2

4

The OP asked for help with the arguments to the cast() function of the reshape package. However, the reshape package was superseded by the reshape2 package from the same package author. According to the package description, the reshape2 package is

A Reboot of the Reshape Package

Using reshape2, the desired result can be produced with

reshape2::dcast(wc, PARENT_MOL_CHEMBL_ID ~ TARGET_TYPE, fun.aggregate = length, 
                value.var = "TARGET_TYPE")
#  PARENT_MOL_CHEMBL_ID ABL EGFR TP53
#1                  C10   1    1    0
#2                 C939   0    0    1

BTW: The data.table package has implemented (and enhanced) dcast() as well. So, the same result can be produced with

data.table::dcast(wc, PARENT_MOL_CHEMBL_ID ~ TARGET_TYPE, fun.aggregate = length, 
                  value.var = "TARGET_TYPE")

Additional columns

The OP mentioned other columns in the data frame which should be shown together with the spread or wide data. Unfortunately, the OP hasn't supplied particular sample data, so we have to consider two use cases.

Case 1: Additional columns go along with the id column

The data could look like

wc
#  PARENT_MOL_CHEMBL_ID TARGET_TYPE extra_col1
#1                  C10         ABL          a
#2                  C10        EGFR          a
#3                 C939        TP53          b

Note that the values in extra_col1 are in line with PARENT_MOL_CHEMBL_ID.

This is an easy case, because the formula in dcast() accepts ... which represents all other variables not used in the formula:

reshape2::dcast(wc, ... ~ TARGET_TYPE, fun.aggregate = length, 
                value.var = "TARGET_TYPE")
#  PARENT_MOL_CHEMBL_ID extra_col1 ABL EGFR TP53
#1                  C10          a   1    1    0
#2                 C939          b   0    0    1

The resulting data.frame does contain all other columns.

Case2: Additional columns don't go along with the id column

Now, another column is added:

wc
#  PARENT_MOL_CHEMBL_ID TARGET_TYPE extra_col1 extra_col2
#1                  C10         ABL          a          1
#2                  C10        EGFR          a          2
#3                 C939        TP53          b          3

Note that extra_col2 has two different values for C10. This will cause the simple approach to fail. So, a two step approach has to be implemented: reshaping first and joining afterwards with the original data frame. The data.table package is used for both steps, now:

library(data.table)
# reshape from long to wide, result has only one row per id column
wide <- dcast(setDT(wc), PARENT_MOL_CHEMBL_ID ~ TARGET_TYPE, fun.aggregate = length, 
                value.var = "TARGET_TYPE")
# right join, i.e., all rows of wc are included
wide[wc, on = "PARENT_MOL_CHEMBL_ID"]
#   PARENT_MOL_CHEMBL_ID ABL EGFR TP53 TARGET_TYPE extra_col1 extra_col2
#1:                  C10   1    1    0         ABL          a          1
#2:                  C10   1    1    0        EGFR          a          2
#3:                 C939   0    0    1        TP53          b          3

The result shows the aggregated values in wide format together with any other columns.

Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • what does the argument 'fun.aggregate = length' do? – KHAN irfan Apr 20 '17 at 20:17
  • 1
    @KHANirfan Please, see the documentation of the `dcast()` function (`?data.table::dcast`): _`fun.aggregate` Should the data be aggregated before casting? If the formula doesn't identify a single observation for each cell, then aggregation defaults to `length` with a message. **NEW**: it is possible to provide a list of functions to fun.aggregate. See Examples._ In this cases, it simply counts the number of occurences. – Uwe Apr 20 '17 at 20:30
  • I just want a binary representation of the data and I do not want a count. which aggregate function is required? or do I remove the function from the arguments of dcast? I removed the aggregate function but when I check the range of the data It ranges from 1-21. – KHAN irfan Apr 21 '17 at 13:03
  • I am currently using the fun aggregate and then I convert by subsetting and then combining the data sset[sset > 1 ] <- 1 – KHAN irfan Apr 21 '17 at 13:47
1
library(tidyverse)

# create an example dataset
wc <- data.frame(PARENT_MOL_CHEMBL_ID = c("C10", "C10", "C939"), TARGET_TYPE = c("ABL", "EGFR", "TP53"))

wc <- wc %>% 
  # create an auxiliary variable
  mutate(AUX = 1) %>%
  # spread the data from long to wide and fill the empty cells with 0
  # EDIT based on Sotos comment. Thanks! :)
  spread(TARGET_TYPE, AUX, fill = 0)

The result of this is:

  PARENT_MOL_CHEMBL_ID ABL EGFR TP53
    C10                 1    1    0
   C939                 0    0    1

More info: http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/

Felix Grossmann
  • 1,224
  • 1
  • 11
  • 30
  • FYI, `spread` has a fill argument (`spread(TARGET, AUX, fill = 0)`) – Sotos Apr 05 '17 at 09:31
  • Thanks @Sotos! Changed it. Makes the answer even simplier :) – Felix Grossmann Apr 05 '17 at 09:34
  • I have a basic question, how to specify the columns in a dataframe as my dataframe 74,000 rows. In place of this row df <- data.frame(PARENT = c("C10", "C10", "C939"), TARGET = c("ABL", "EGFR", "TP53")). Is it possible to use df <- data.frame(PARENT = wc$PARENT_MOL_CHEMBL_ID, TARGET = wc$TARGET_TYPE) – KHAN irfan Apr 05 '17 at 10:36
  • Let me know if I understand you correctly: your dataframe is named "wc" and it includes two columns, "PARENT_MOL_CHEMBL_ID" and "TARGET_TYPE"? If this is true, you can use your dataset just by replacing the dataframe name df with wc and the column name TARGET with TARGET_TYPE. I created this dataframe because you didnt deliver code for creating it. Hope this helps :) – Felix Grossmann Apr 05 '17 at 10:43
  • I have to add: you just have to replace the manes in the part starting with `df <- df %>% (...)`. My dataset is not needed anymore then :) – Felix Grossmann Apr 05 '17 at 10:50
  • Thank you. But I have one question there is an extra column added in my data.frame that does not have a name, It also appears with reshape package. Can you comment if that extra column has any significance with data reshaping – KHAN irfan Apr 05 '17 at 11:13
  • Can you please pick my answer (check the mark), if I could help you with your question? This would be very kind :) To your further question: do you need this column? in the new dataframe? – Felix Grossmann Apr 05 '17 at 11:18
  • I have liked your answer. However when I repeat this procedure for other column which has non-named cells I get the error. Error: Each variable must be named. Problem variables: 16 kindly let me know how to solve this – KHAN irfan Apr 05 '17 at 11:33
  • Your question has the status answered only if you pick an answer by clicking on the check mark right under the up-/down-vote thingy. This is what I meant :) Why don't you give this column a name when you initialize it? This should solve your problem :) – Felix Grossmann Apr 05 '17 at 11:43
  • @fexjoo I'm kindly suggesting to rephrase your answer to use the variable and column names the OP has used in his Q. Although this doesn't change the quality of your answer it perhaps would have helped to avoid irritations which apparently have taken place. Thank you. – Uwe Apr 05 '17 at 12:01
  • @KHANirfan What do you mean by _non-named cells_? In contrast to Excel, R is not aware of _cells_ but deals with vectors or rows and columns of data frames or matrices. – Uwe Apr 05 '17 at 12:03
  • Kindly like my question as well so that I can cross the 15 point barrier. thanks – KHAN irfan Apr 05 '17 at 12:36
  • @Uwe I had to perform the spread on another variable in a larger dataframe. But I was unable to do it unless I delete the spreaded column. And If the second variable has any empty cells that the spread does not work. so my dataframe had 16 empty values in the second variable. that was the reason of the error. But now we know what If we have to repeat the same protocol for other columns of dataframe then we have to delete the already spreaded column because R disposes the column name but not the column – KHAN irfan Apr 05 '17 at 12:40
  • @UweBlock This is a good idea. I edited it. Thank you. KHAN irfan did I understand you correctly that you solved the problem by yourself now? – Felix Grossmann Apr 05 '17 at 12:54
  • @fexjoo yes. Thanks every1 – KHAN irfan Apr 05 '17 at 13:04
  • @UweBlock Done, thanks. – Felix Grossmann Apr 05 '17 at 13:15
  • @KHANirfan To be honest, I feel more comfortable with `melt()`, `dcast()`, and `data.table` than `tidyverse`. So, I might not be fully aware of possible quirks of `spread()`. I'm just curious, have you tried the `data.table::dcast()` on the larger data.frame after coercing it to `data.table` with `setDT(wc)`? – Uwe Apr 05 '17 at 13:22
  • @Uwe Yes It gives me a correctly made table but the problem is it gives a separate data.frame that only contains the the spreaded list in table form, but this approach gives the whole dataframe along with the other variables. It saves one extra step to unite that cast function would require. I do not also have to worrie to match with the correct rows in that case. You can add 2 more columns to this dataframe and observe by spreading two columns back to back in the same dataframe and observe the behaviour of both the functions. – KHAN irfan Apr 05 '17 at 14:22