0

I have multiple data.frame where each has same weather stations' coordinate but contains different year's temperature observation. However, I intend to construct new data.frame where stations' coordinate will stay but respective annual temperature column will be added programmatically from original multiple data.frame. Perhaps using dplyr package could help, but I have some issue to concatenate Year and Annual_Temp column and construct new column programmatically. Because I have 35 data.frames where each has same ID, long, lat, but Annual_Temp are different from one to another. I need to construct clean tabular data by merging data.frame. How can I make this happen in R? Any way to get this done by using dplyr? Any idea?

For example, here is head of first three data.frame:

> multiple_DF

$air_temp.1980
      Year         ID long   lat Annual_Temp
34090 1980 6.25_51.75 6.25 51.75   10.709091
34091 1980 6.25_51.25 6.25 51.25   10.581818
34092 1980 6.25_50.75 6.25 50.75    9.500000
34224 1980 6.75_51.75 6.75 51.75   10.354545
34225 1980 6.75_51.25 6.75 51.25   10.636364
34226 1980 6.75_50.75 6.75 50.75    9.872727

$air_temp.1981
       Year         ID long   lat Annual_Temp
119884 1981 6.25_51.75 6.25 51.75   10.727273
119885 1981 6.25_51.25 6.25 51.25   10.563636
119886 1981 6.25_50.75 6.25 50.75    9.654545
120018 1981 6.75_51.75 6.75 51.75   10.409091
120019 1981 6.75_51.25 6.75 51.25   10.654545
120020 1981 6.75_50.75 6.75 50.75    9.954545

$air_temp.1982
       Year         ID long   lat Annual_Temp
205678 1982 6.25_51.75 6.25 51.75    11.80909
205679 1982 6.25_51.25 6.25 51.25    11.58182
205680 1982 6.25_50.75 6.25 50.75    10.61818
205812 1982 6.75_51.75 6.75 51.75    11.44545
205813 1982 6.75_51.25 6.75 51.25    11.73636
205814 1982 6.75_50.75 6.75 50.75    10.85455

Desired output (UPDATE):

I want to produce new data.frame where Annual_Temp will be added as the new column where Annual_Temp and Year must be concatenated. Here is the desired data.frame that I want to have:

      ID long   lat Ann_temp_1980 Ann_temp_1981 Ann_temp_1982
1 6.25_51.75 6.25 51.75     10.709091     10.727273        11.80909
2 6.25_51.25 6.25 51.25     10.581818     10.563636        11.58182
3 6.25_50.75 6.25 50.75      9.500000      9.654545        10.61818
4 6.75_51.75 6.75 51.75     10.354545     10.409091        11.44545
5 6.75_51.25 6.75 51.25     10.636364     10.654545        11.73636
6 6.75_50.75 6.75 50.75      9.872727      9.954545        10.85455

How can I make this happen programmatically in R? Any idea?

To repro example data:

multiple_DF = structure(list(air_temp.1980 = structure(list(Year = c(1980L, 
1980L, 1980L, 1980L, 1980L, 1980L), ID = c("6.25_51.75", "6.25_51.25", 
"6.25_50.75", "6.75_51.75", "6.75_51.25", "6.75_50.75"), long = c(6.25, 
6.25, 6.25, 6.75, 6.75, 6.75), lat = c(51.75, 51.25, 50.75, 51.75, 
51.25, 50.75), Annual_Temp = c(10.709091, 10.581818, 9.5, 10.354545, 
10.636364, 9.872727)), .Names = c("Year", "ID", "long", "lat", 
"Annual_Temp"), row.names = c(NA, -6L), class = "data.frame"), 
    air_temp.1981 = structure(list(Year = c(1981L, 1981L, 1981L, 
    1981L, 1981L, 1981L), ID = c("6.25_51.75", "6.25_51.25", 
    "6.25_50.75", "6.75_51.75", "6.75_51.25", "6.75_50.75"), 
        long = c(6.25, 6.25, 6.25, 6.75, 6.75, 6.75), lat = c(51.75, 
        51.25, 50.75, 51.75, 51.25, 50.75), Annual_Temp = c(10.727273, 
        10.563636, 9.654545, 10.409091, 10.654545, 9.954545)), .Names = c("Year", 
    "ID", "long", "lat", "Annual_Temp"), row.names = c(NA, -6L
    ), class = "data.frame"), air_temp.1982 = structure(list(
        Year = c(1982L, 1982L, 1982L, 1982L, 1982L, 1982L), ID = c("6.25_51.75", 
        "6.25_51.25", "6.25_50.75", "6.75_51.75", "6.75_51.25", 
        "6.75_50.75"), long = c(6.25, 6.25, 6.25, 6.75, 6.75, 
        6.75), lat = c(51.75, 51.25, 50.75, 51.75, 51.25, 50.75
        ), Annual_Temp = c(11.80909, 11.58182, 10.61818, 11.44545, 
        11.73636, 10.85455)), .Names = c("Year", "ID", "long", 
    "lat", "Annual_Temp"), row.names = c(NA, -6L), class = "data.frame")), .Names = c("air_temp.1980", 
"air_temp.1981", "air_temp.1982"))
Frank
  • 66,179
  • 8
  • 96
  • 180
Hamilton
  • 620
  • 2
  • 14
  • 32
  • Fwiw, I think you should make two tables: one with time-invariant ID attributes (lat, long); and another with year-varying variables, with cols ID, Year, Annual_Temp. The reasoning is covered in this paper by the dplyr package's author: https://www.jstatsoft.org/article/view/v059i10 – Frank Apr 04 '18 at 17:50
  • @Frank would it be possible to produce your handy solution for this problem? – Hamilton Apr 04 '18 at 17:51
  • Yeah, I think so. Could you make the question (more easily) reproducible? This looks similar to some built-in datasets you could start from (try `data()`); you might also want to see this Q&A re making a reproducible example: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 (I'll be afk for a few; won't be able to post an answer soon.) – Frank Apr 04 '18 at 17:54
  • 1
    @Frank perhaps using `dput` could make a reproducible example. Could you share your thoughts? Thank you – Hamilton Apr 04 '18 at 18:02
  • 2
    Jerry, that's a great tool to use for generating a reproducible example. Use it on each of your frames, perhaps as `dput(head(x,n=10))`, and post the output (each should start with `structure(...`). It is common, though, that `head` will not be enough: you need to have enough data in all frames such that the merge will be done correctly and that there is enough variability in this sample data to see the differences in merging. So you may need to choose very specific rows from each frame. (Having 6-12 rows is a good ballpark, but as always it varies.) – r2evans Apr 04 '18 at 19:46
  • I've added what I guess your dput would be; feel free to edit/correct it. – Frank Apr 04 '18 at 22:35

2 Answers2

4

First, combining the tables in long form:

library(data.table)
L = lapply(multiple_DF, data.table)

bigDT = rbindlist(L, id="src")

              src Year         ID long   lat Annual_Temp
 1: air_temp.1980 1980 6.25_51.75 6.25 51.75   10.709091
 2: air_temp.1980 1980 6.25_51.25 6.25 51.25   10.581818
 3: air_temp.1980 1980 6.25_50.75 6.25 50.75    9.500000
 4: air_temp.1980 1980 6.75_51.75 6.75 51.75   10.354545
 5: air_temp.1980 1980 6.75_51.25 6.75 51.25   10.636364
 6: air_temp.1980 1980 6.75_50.75 6.75 50.75    9.872727
 7: air_temp.1981 1981 6.25_51.75 6.25 51.75   10.727273
 8: air_temp.1981 1981 6.25_51.25 6.25 51.25   10.563636
 9: air_temp.1981 1981 6.25_50.75 6.25 50.75    9.654545
10: air_temp.1981 1981 6.75_51.75 6.75 51.75   10.409091
11: air_temp.1981 1981 6.75_51.25 6.75 51.25   10.654545
12: air_temp.1981 1981 6.75_50.75 6.75 50.75    9.954545
13: air_temp.1982 1982 6.25_51.75 6.25 51.75   11.809090
14: air_temp.1982 1982 6.25_51.25 6.25 51.25   11.581820
15: air_temp.1982 1982 6.25_50.75 6.25 50.75   10.618180
16: air_temp.1982 1982 6.75_51.75 6.75 51.75   11.445450
17: air_temp.1982 1982 6.75_51.25 6.75 51.25   11.736360
18: air_temp.1982 1982 6.75_50.75 6.75 50.75   10.854550

Then somewhat "normalizing" the data into multiple tables:

ID_attr = unique(bigDT[, c("ID", "lat", "long")])

           ID   lat long
1: 6.25_51.75 51.75 6.25
2: 6.25_51.25 51.25 6.25
3: 6.25_50.75 50.75 6.25
4: 6.75_51.75 51.75 6.75
5: 6.75_51.25 51.25 6.75
6: 6.75_50.75 50.75 6.75

meas_data = bigDT[, c("Year", "ID", "Annual_Temp")]

    Year         ID Annual_Temp
 1: 1980 6.25_51.75   10.709091
 2: 1980 6.25_51.25   10.581818
 3: 1980 6.25_50.75    9.500000
 4: 1980 6.75_51.75   10.354545
 5: 1980 6.75_51.25   10.636364
 6: 1980 6.75_50.75    9.872727
 7: 1981 6.25_51.75   10.727273
 8: 1981 6.25_51.25   10.563636
 9: 1981 6.25_50.75    9.654545
10: 1981 6.75_51.75   10.409091
11: 1981 6.75_51.25   10.654545
12: 1981 6.75_50.75    9.954545
13: 1982 6.25_51.75   11.809090
14: 1982 6.25_51.25   11.581820
15: 1982 6.25_50.75   10.618180
16: 1982 6.75_51.75   11.445450
17: 1982 6.75_51.25   11.736360
18: 1982 6.75_50.75   10.854550

I think this format will be easier to work with than the wide format the OP requested (where the year is embedded in the string column name). Hadley Wickham's tidy data paper may be a useful reference.

To do this in dplyr, use bind_rows instead of rbindlist; or just do.call(rbind, L) in base R.

Frank
  • 66,179
  • 8
  • 96
  • 180
1

As Frank points out, it would be easier with reproducible data, but I think the following will work:

library(tidyverse)
DF<-do.call("rbind", multiple_DF)
DF$Year<-paste0("Ann_temp_",DF$Year)
DF_final<-spread(DF,Year,Annual_Temp)
TJ83
  • 71
  • 4