Reshape data in R?

Question

I have mean daily data for different sites organized as shown in figure 1 in this folder.

However, I want to organize this data to look like figure 2 in the same folder.

Using this code, the data was reshaped but the final values (reshpae_stage_R.csv) didn't match the original values.

By running the code for the second time, I got this error:

Error in `row.names<-.data.frame`(`*tmp*`, value = paste(d[, idvar], times[1L],  : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘NA.January’

Could you please help me why the final values don't match the original values?

Thanks in advance

As stated in the error, you may have duplicated ids. So, one possibility may be to create an index column by group using `ave`. I think there are a lot of duplicates for this problem. — akrun, Jan 18 '15 at 04:24
Don't post data in pictures. Please create a minimal [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and include it in the question. — MrFlick, Jan 18 '15 at 05:18
@MrFlick The whole data set are here (https://www.dropbox.com/s/jotz9zhtvy0qn7o/stage.csv?dl=0) — , Jan 18 '15 at 05:49
Please include all relevant code and data in the question. Links can and will go dead without warning, which will make this question obsolete for future readers. — Roman Luštrik, Jan 19 '15 at 13:42

tchakravarty · Accepted Answer · 2015-01-19T06:25:41.950

Update:

Thanks to @aelwan for catching a bug, and the updated code is below:

library(ggplot2)
library(reshape2)

# read in the data
dfStage = read.csv("reshapeR/Data/stage.csv", header = FALSE, stringsAsFactor = FALSE)

# remove the rows which are min, max, mean & redundant columns
condMMM = stringr::str_trim(dfStage[, 1]) %in% c("Min", "Max", "Mean", "Day")
dfStage = dfStage[!condMMM, 1:13]
dateVars = c("Day", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
colnames(dfStage) = dateVars

# get indices & names of year site combinations
condlSiteYear = grepl("^Daily means", stringr::str_trim(dfStage[, 1]))
condiSiteYear = grep("^Daily means", stringr::str_trim(dfStage[, 1]))
dfSiteYear = dfStage[condlSiteYear,  1, drop = FALSE]

# remove site-year rows from data
dfStage = dfStage[!condlSiteYear, ]

# get the list of sites and years
dfSiteYear$Year = regmatches(dfSiteYear[, 1], regexpr("(?<=Year\\s)([0-9]+)", dfSiteYear[, 1], perl = TRUE))
dfSiteYear$Site = regmatches(dfSiteYear[, 1], 
           regexpr("(?<=(Stage\\s\\(mm\\)\\sat\\s))([A-Za-z\\s0-9\\.]+)", dfSiteYear[, 1], perl = TRUE))

# add the site and years
dfSiteYearLong = dfSiteYear[rep(1:dim(dfSiteYear)[1], each = 31), c("Site", "Year")]
dfStageFinal = cbind(dfStage, dfSiteYearLong)

# reshape
dfStageFinalLong = reshape2::melt(dfStageFinal, id.vars = c("Day", "Site", "Year"), 
                                  measure.vars = dateVars[-1],
                        variable.name = "Month")
dfStageFinalWide = reshape2::dcast(dfStageFinalLong, Day + Month + Year ~ Site, 
                                   value.var = "value")

# cleanup
dfStageFinalWide[, -c(1:3)] = lapply(dfStageFinalWide[, -c(1:3)], as.numeric)

# create a date variable
dfStageFinalWide$Date = with(dfStageFinalWide, 
                             as.Date(paste(Day, Month, Year, sep = "-"), 
                                     format = "%d-%b-%Y"))
# remove the infeasible dates
dfStageFinalWide = dfStageFinalWide[!is.na(dfStageFinalWide$Date), ]
dfStageFinalWide = dfStageFinalWide[order(dfStageFinalWide$Date), ]

# plot the values over time
dfStageFinalLong = 
  reshape2::melt(dfStageFinalWide, id.vars = "Date", measure.vars = unique(dfSiteYear$Site),
       variable.name = "Site")
ggplot(dfStageFinalLong, aes(x = Date, y = value, color = Site))+
  geom_line() + theme_bw() + facet_wrap(~ Site, scale = "free_y")

This leads to the picture below: enter image description here

Original answer:

This example requires a fair amount of data munging skills. You basically have to note the repeating patters in the data -- the data are site-year measurements organized as day x month tables.

Recipe:

Here is a recipe for creating the desired dataset:
1. Remove the rows & columns in the data that are redundant.
2. Extract the rows that identify the year and the site of the table using pattern matching (grep).
3. From the longer string, extract the year and site name using regular expressions (regexpr and regmatches).
4. Find the starting row indices of the tables for each site-year combination and assign the site-year names just extracted to all rows that correspond to that site & year.
5. Now you can go ahead and reshape it into any shape you want. In the code below, the row identifiers are year, month and day, and the columns are the sites.
6. Some cleanup, and you are good to go.

Code:

Here is code for the recipe above:

# read in the data
dfStage = read.csv("reshapeR/Data/stage.csv", header = FALSE, stringsAsFactor = FALSE)

# remove the rows which are min, max, mean & redundant columns
condMMM = stringr::str_trim(dfStage[, 1]) %in% c("Min", "Max", "Mean", "Day")
dfStage = dfStage[!condMMM, 1:13]
dateVars = c("Day", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
colnames(dfStage) = dateVars

# get indices & names of year site combinations
condlSiteYear = grepl("^Daily means", stringr::str_trim(dfStage[, 1]))
condiSiteYear = grep("^Daily means", stringr::str_trim(dfStage[, 1]))
dfSiteYear = dfStage[condlSiteYear,  1, drop = FALSE]

# remove site-year rows from data
dfStage = dfStage[!condlSiteYear, ]

# get the list of sites and years
dfSiteYear$Year = regmatches(dfSiteYear[, 1], regexpr("(?<=Year\\s)([0-9]+)", dfSiteYear[, 1], perl = TRUE))
dfSiteYear$Site = regmatches(dfSiteYear[, 1], 
           regexpr("(?<=(Stage\\s\\(mm\\)\\sat\\s))([A-Za-z\\s0-9\\.]+)", dfSiteYear[, 1], perl = TRUE))

# add the site and years
dfSiteYearLong = dfSiteYear[rep.int(1:dim(dfSiteYear)[1], 31), c("Site", "Year")]
dfStageFinal = cbind(dfStage, dfSiteYearLong)

# reshape
dfStageFinalLong = reshape2::melt(dfStageFinal, id.vars = c("Day", "Site", "Year"), measure.vars = dateVars[-1],
                        variable.name = "Month")
dfStageFinalWide = dcast(dfStageFinalLong, Day + Month + Year ~ Site, value.var = "value")

# cleanup
dfStageFinalWide[, -c(1:3)] = lapply(dfStageFinalWide[, -c(1:3)], as.numeric)

# create a date variable
dfStageFinalWide$Date = with(dfStageFinalWide, 
                             as.Date(paste(Day, Month, Year, sep = "-"), 
                                     format = "%d-%b-%Y"))
# remove the infeasible dates
dfStageFinalWide = dfStageFinalWide[!is.na(dfStageFinalWide$Date), ]
dfStageFinalWide = dfStageFinalWide[order(dfStageFinalWide$Date), ]

# plot the values over time
dfStageFinalLong = 
  melt(dfStageFinalWide, id.vars = "Date", measure.vars = unique(dfSiteYear$Site),
       variable.name = "Site")
ggplot(dfStageFinalLong, aes(x = Date, y = value, color = Site))+
  geom_line() + theme_bw() + facet_wrap(~ Site, scale = "free_y")

Output:

Here is what the output looks like:

> head(dfStageFinalWide)
  Day Month Year Kumeti at Te Rehunga Makakahi at Hamua Makuri at Tuscan Hills Manawatu at Hopelands Manawatu at Upper Gorge Manawatu at Weber Road Mangahao at Ballance
1   1   Jan 1990                  454                NA                    700                  5133                      NA                     NA                   NA
2   1   Jan 1991                 1002              3643                   1416                    50                    3597                   1836                18160
3   1   Jan 1992                 3490             34239                   8922                  3049                    1221                    417                   NA
4   1   Jan 1993                  404                NA                    396                  3408                      NA                    272                   NA
5   1   Jan 1994                   NA                NA                   3189                   795                      NA                   2321                 1889
6   1   Jan 1995                16548              1923                  69862                  4808                      NA                   6169                   94
  Mangapapa at Troup Rd Mangatainoka at Larsons Road Mangatainoka at Pahiatua Town Bridge Mangatainoka at Tararua Park Mangatoro at Mangahei Road Oruakeretaki at S.H.2 Napier
1                  9406                         2767                                   NA                           NA                       6838                         2831
2                  4985                         2479                                  823                         1078                         76                          105
3                   478                         3665                                 1415                          210                        394                         8247
4                  6394                         1298                                   NA                         2668                       3837                         1878
5                 14051                         3561                                   NA                         2645                        807                           NA
6                    NA                         1057                                 7029                         4497                         NA                           NA
  Raparapawai at Jackson Rd Tamaki at Stephensons Tiraumea at Ngaturi
1                      5189                 50444               17951
2                       345                   416                3025
3                      1364                  5713                1710
4                      3457                 28078                8670
5                       199                    NA                 292
6                        NA                    NA               22774

And a picture to bring it all together. enter image description here

I can't thank you enough for your time and help. However, I'm getting the following error at #reshape Error in eval(expr, envir, enclos) : object 'Month' not found. Could you please help me to overcome this problem? — , Jan 18 '15 at 06:58
Did you run this line `dfStageFinalLong = reshape2::melt(dfStageFinal, id.vars = c("Day", "Site", "Year"), measure.vars = dateVars[-1], variable.name = "Month")` unchanged? Note the `variable.name` argument. — tchakravarty, Jan 18 '15 at 07:03
Yes, I did. Please, check a print screen of the error (https://www.dropbox.com/s/eqs64qmehb3yxe9/error.jpg?dl=0) — , Jan 18 '15 at 07:10
Ok, can you show a `head` of `dfStageFinalLong`? Also, try to replace `Month` in the call to `dcast` to `variable`. — tchakravarty, Jan 18 '15 at 07:11
This is the head of the dfStageFinalLong (https://www.dropbox.com/s/hjxwmwtj39hhsd7/head.jpg?dl=0). — , Jan 18 '15 at 07:15
@aelwan Yes, as I thought, the `variable.name` argument is not getting used in the call to `melt`. Switch `Month` to `variable` in the call to `dcast`. My guess is that you are using an old version of `reshape2`. — tchakravarty, Jan 18 '15 at 07:16
reshape2 was 1.4 but I updated it to 1.4.1 and the code is working perfect. Thank you! However, as you have shown above the final result looks like that (1st/Jan, 1st/Feb, 1st/March and so on) (https://www.dropbox.com/s/cef5cmrr0grhapy/Final.jpg?dl=0). I was hoping that I could get the date organized continuously to look like this (https://www.dropbox.com/s/fs9vx5dbgvtltxz/4.jpg?dl=0). Any idea how could I organize it to increase continuously from 1/Jan/1990 to 31/Dec/2014? — , Jan 18 '15 at 07:58
Many thanks. But, I'm afraid that the final values DON'T MATCH the original values, even in your earlier answers. Any suggestion to fix that would be much appreciated. — , Jan 18 '15 at 09:54
@aelwan `3.jpg` in your dropbox has headers clipped, so I have no idea what it is. But you can easily check that the the numbers in `dfStageFinalWide` correspond exactly to `stage.csv`. For example, in `stage.csv`, the value for `Kumeti at Te Rehunga` on `01/01/1990` is 454, and it is the same in `dfStageFinalWide`. **Now it is up to you to tell us what the value 4131 in `3.jpg` is.** — tchakravarty, Jan 18 '15 at 10:20
Yes, you're right. Only the value for Kumeti at Te Rehunga on 01/01/1990 matches the original value in stage.csv. However, the rest of the values using your code (http://goo.gl/ZicvLw) don't match the original values (http://goo.gl/wl5fCk). This was the problem I had using the code I referred to in my question. The results I have got in 3.jpg (using the code in the question) were completely inaccurate as they don't match the original values. I have updated 3.jpg (http://goo.gl/OHZvMR) so you could check the headers. — , Jan 19 '15 at 00:05
@aelwan Agreed. Let me try to figure out what is going wrong. — tchakravarty, Jan 19 '15 at 05:16