Spread over multiple columns in R

Question

I've got 6 genes in 3 timepoint data in a long format that I'm trying to spread with six columns by six genes. always has this error. 'Do you need to create unique ID with tibble::rowid_to_column()? Call rlang::last_error() to see a backtrace'

fgcrkmtptlog



 -   timepointgene  treatment value           tpt6
   1    24  crk10   treated 1.7883197   24 treated
   2    24  crk10   treated 1.0605152   24 treated
   3    24  crk10   treated 1.0050634   24 treated
   4    24  crk10   treated 1.8876708   24 treated
   5    24  crk10   treated 1.4960427   24 treated
   6    48  crk10   treated 2.4190837   48 treated
   7    48  crk10   treated 2.9805329   48 treated
   8    48  crk10   treated 3.4241471   48 treated
   9    48  crk10   treated 2.3705634   48 treated
   10   48  crk10   treated 2.0378527   48 treated
   11   72  crk10   treated 2.5438502   72 treated
   12   72  crk10   treated 3.7291318   72 treated
   13   72  crk10   treated 2.8419034   72 treated
   14   72  crk10   treated 3.3363484   72 treated
   15   72  crk10   treated 3.2231344   72 treated
   16   24  crk18   treated 2.0620297   24 treated
   17   24  crk18   treated 1.5837581   24 treated
   18   24  crk18   treated 2.1590703   24 treated
   19   24  crk18   treated 2.1706227   24 treated
   20   24  crk18   treated 2.4964019   24 treated
   21   48  crk18   treated 2.6026845   48 treated
   22   48  crk18   treated 2.7898342   48 treated
   23   48  crk18   treated 2.6719992   48 treated
   24   48  crk18   treated 2.7574874   48 treated
   25   48  crk18   treated 3.4852919   48 treated
   26   72  crk18   treated 3.1710652   72 treated
   27   72  crk18   treated 3.3720779   72 treated
   28   72  crk18   treated 1.8194282   72 treated
   29   72  crk18   treated 2.8221811   72 treated
   30   72  crk18   treated 2.8395098   72 treated
   31   24  crk23   treated 0.9164792   24 treated
   32   24  crk23   treated 0.9580680   24 treated
   33   24  crk23   treated 0.5976315   24 treated
   34   24  crk23   treated 1.0597296   24 treated
   35   24  crk23   treated 1.0389352   24 treated
   36   48  crk23   treated 2.1156238   48 treated
   37   48  crk23   treated 2.8226339   48 treated
   38   48  crk23   treated 3.4533979   48 treated
   39   48  crk23   treated 2.7486982   48 treated
   40   48  crk23   treated 2.0324462   48 treated
   41   72  crk23   treated 3.1622761   72 treated
   42   72  crk23   treated 1.7135985   72 treated
   43   72  crk23   treated 2.7186619   72 treated
   44   72  crk23   treated 2.7810451   72 treated
   45   72  crk23   treated 1.4502025   72 treated
   46   24  crk24   treated 0.5338245   24 treated
   47   24  crk24   treated 0.4759149   24 treated
   48   24  crk24   treated 1.1967879   24 treated
   49   24  crk24   treated 1.0627795   24 treated
   50   24  crk24   treated 1.1429535   24 treated
   51   48  crk24   treated 1.4532524   48 treated
   52   48  crk24   treated 2.2573031   48 treated
   53   48  crk24   treated 2.3474122   48 treated
   54   48  crk24   treated 2.2203353   48 treated
   55   48  crk24   treated 2.4594710   48 treated
   56   72  crk24   treated 2.3058234   72 treated
   57   72  crk24   treated 2.4236584   72 treated
   58   72  crk24   treated 2.5484249   72 treated
   59   72  crk24   treated 2.6685704   72 treated
   60   72  crk24   treated 2.0967240   72 treated
   61   24  crk40   treated 1.0119949   24 treated
   62   24  crk40   treated 1.0813096   24 treated
   63   24  crk40   treated 1.7328680   24 treated
   64   24  crk40   treated 1.9962639   24 treated
   65   24  crk40   treated 2.3567004   24 treated
   66   48  crk40   treated 3.5558450   48 treated
   67   48  crk40   treated 2.6131649   48 treated
   68   48  crk40   treated 2.5299872   48 treated
   69   48  crk40   treated 3.4911513   48 treated
   70   48  crk40   treated 3.3247960   48 treated
   71   72  crk40   treated 4.8381673   72 treated
   72   72  crk40   treated 4.9352079   72 treated
   73   72  crk40   treated 4.4292105   72 treated
   74   72  crk40   treated 3.8631403   72 treated
   75   72  crk40   treated 4.0052355   72 treated
   76   24  crk47   treated 0.1378544   24 treated
   77   24  crk47   treated 1.9212654   24 treated
   78   24  crk47   treated 2.3856740   24 treated
   79   24  crk47   treated 1.6301435   24 treated
   80   24  crk47   treated 1.6994583   24 treated
   81   48  crk47   treated 2.8292882   48 treated
   82   48  crk47   treated 2.9817805   48 treated
   83   48  crk47   treated 2.9055344   48 treated
   84   48  crk47   treated 2.9817805   48 treated
   85   48  crk47   treated 3.0199036   48 treated
   86   72  crk47   treated 2.7876993   72 treated
   87   72  crk47   treated 2.9055344   72 treated
   88   72  crk47   treated 3.6472018   72 treated
   89   72  crk47   treated 2.5866866   72 treated
   90   72  crk47   treated 2.6698643   72 treated

I'm trying to get it into a data format with genes and timepoint as columns, and six gene with three timepoint


   fgcrkmtptlog %>% 
     group_by(timepoint) %>% 
     spread(gene, value)

enter image description here

i want the data like this picture

after use

fgcrkmtptlog %>% 
  rowid_to_column() %>%
  spread(gene, value)

df shows lots of NA

1   1   24  treated 24 treated  1.788320    NA  NA  NA  NA  NA
2   2   24  treated 24 treated  1.060515    NA  NA  NA  NA  NA
3   3   24  treated 24 treated  1.005063    NA  NA  NA  NA  NA
4   4   24  treated 24 treated  1.887671    NA  NA  NA  NA  NA
5   5   24  treated 24 treated  1.496043    NA  NA  NA  NA  NA
6   6   48  treated 48 treated  2.419084    NA  NA  NA  NA  NA

could you provide a sample dataset? I'm afraid the example you created is not very readable. — maop, Aug 02 '19 at 20:19
Possible duplicate of [How can I spread repeated measures of multiple variables into wide format?](https://stackoverflow.com/questions/29775461/how-can-i-spread-repeated-measures-of-multiple-variables-into-wide-format) — Shree, Aug 02 '19 at 20:32

Fnguyen · Answer 1 · 2019-08-02T20:58:22.020

1

spread needs a unique row-id otherwise it cannot work. If your first column (that is used as id) contains duplicates you need to create a new unique row-id.

The error message you posted said exactly this, so add the following to your code:

fgcrkmtptlog %>% 
    # group_by(timepoint) %>% I took this out because group_by should be unnecessary here
     rowid_to_column() %>%
     spread(gene, value)

This will solve your current error.

Edit:

Depending on your data, spread may introduce NAs here is an example:

# Produce sample data
df <- structure(list(Year = c("2014", "2014", "2014", "2014", "2015", 
"2015", "2015", "2015", "2016"), Month = c("01", "06", "07", 
"12", "01", "06", "07", "12", "01"), Day = c("01", "01", "01", 
"01", "01", "01", "01", "01", "01"), test = structure(c(1L, 1L, 
1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), 
    Halfyear = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L
    ), .Label = c("2014 First Half", "2015 First Half", "2016 First Half"
    ), class = "factor")), class = "data.frame", row.names = c(NA, 
-9L))

# Your code
df <- data.frame(years,test)
  df %>%
    rowid_to_column() %>%
    spread(Month,test)

If you test this, you will see that spread correctly introduces NAs as some Months do not have a test value. Since spread creates one column per existing month in my data it has to also show NA where no prior combination of month and test existed.

Before spread you had a sparse data set only showing data that actually exists but spread completes the data set to make it wide.

edited Aug 02 '19 at 20:58

answered Aug 02 '19 at 20:35

Fnguyen

1,159
10
23

yes, thanks, but after this code, some genes' value show NA in some place – the king plant Aug 02 '19 at 20:37
@thekingplant This might be normal, if you spread every gene in your sample get's a column but there might not be a value for every combination of timepoint, treament and gene, so these are naturally empty. I will edit my answer to show an example. – Fnguyen Aug 02 '19 at 20:51
yes, but i try some sultion to delete NA, such as ```b[complete.cases(b), ]```, ```na.omit(b)```, ```b %>% drop_na()```, all didn't work @Zeiram – the king plant Aug 02 '19 at 20:55
@thekingplant You cannot delete the NAs because they are a valid part of your data. Since you transformed the data to wide the column has to show a value for the gene and if there is none it will be NA. Showing only complete cases would remove almost all rows, by using a wide data set you introduced NAs this is unavoidable. The only question is why that bothers you or why you need a wide data set to begin with. – Fnguyen Aug 02 '19 at 21:00
thank you very much, even though the datafram contain NA, i can still use anova, so i just leave NA in dataframe. @Zeriam – the king plant Aug 02 '19 at 21:14
@thekingplant happy to help, feel free to mark this as an answer if it helped. Also make sure how your anova treats NAs and that it isn't simply removing all incomplete rows if that is not what you want. – Fnguyen Aug 02 '19 at 21:16

geekzeus · Answer 2 · 2019-08-02T21:01:03.017

0

#one liner
library(reshape2)
#reshape by `timepoint` and `gene` and sum by `value`
dcast(df, timepoint ~ gene, value.var = "value",sum)

#data
structure(list(timepoint = c(24, 24, 48, 72, 24), gene = structure(c(1L, 
2L, 3L, 2L, 1L), .Label = c("crk10", "crk20", "crk30"), class = "factor"), 
value = c(1.3, 1.5, 0.6, 1.7, 1.1)), .Names = c("timepoint", 
"gene", "value"), row.names = c(NA, -5L), class = "data.frame")

edited Aug 02 '19 at 21:01

answered Aug 02 '19 at 20:57

geekzeus

785
5
14

thanks, i try this before, but the results shows only one value in one timepoint, i still want five value in each timepoint and gene @geekzeus – the king plant Aug 02 '19 at 21:00
provide the sample data by using `dput(df)` like i provided, its time consuming to write the data. – geekzeus Aug 02 '19 at 21:03

Spread over multiple columns in R

2 Answers2