3

Edit: Thanks to those who have responded so far; I'm very much a beginner in R and have just taken on a large project for my MSc dissertation so am a bit overwhelmed with the initial processing. The data I'm using is as follows (from WMO publically available rainfall data):


120 6272100 KHARTOUM 15.60 32.55 382 1899 1989 0.0
1899 0.03 0.03 0.03 0.03 0.03 1.03 13.03 12.03 9999 6.03 0.03 0.03
1900 0.03 0.03 0.03 0.03 0.03 23.03 80.03 47.03 23.03 8.03 0.03 0.03
1901 0.03 0.03 0.03 0.03 0.03 17.03 23.03 17.03 0.03 8.03 0.03 0.03
(...)
120 6272101 JEBEL AULIA 15.20 32.50 380 1920 1988 0.0
1920 0.03 0.03 0.03 0.00 0.03 6.90 20.00 108.80 47.30 1.00 0.01 0.03
1921 0.03 0.03 0.03 0.00 0.03 0.00 88.00 57.00 35.00 18.50 0.01 0.03
1922 0.03 0.03 0.03 0.00 0.03 0.00 87.50 102.30 10.40 15.20 0.01 0.03
(...)

There are ~100 observation stations that I'm interested in, each of which has a varying start and end date for rainfall measurements. They're formatted as above in a single data file, with stations separated by "120 (station number) (station name)".

I need first to separate this file by station, then to extract March, April, May and June for each year, then take a total of these months for each year. So far I'm messing around with loops (as below), but I understand this isn't the right way to go about it and would rather learn some better technique. Thanks again for the help!

(Original question:) I've got a large data set containing rainfall by season for ~100 years over 100+ locations. I'm trying to separate this data into more managable arrays, and in particular I want to retrieve the sum of the rainfall for March, April, May and June for each station for each year. The following is a simplified version of my code so far:

a <- array(1,dim=c(10,12))
for (i in 1:5) {

  all data:
  assign(paste("station_",i,sep=""), a)

  #march - june data:
  assign(paste("station_",i,"_mamj",sep=""), a[,4:7])
}

So this gives me station_(i)__mamj_ which contains the data for the months I'm interested in for each station. Now I want to sum each row of this array and enter it in a new array called station_(i)_mamj_tot. Simple enough in theory, but I can't work out how to reference station_(i)_mamj so that it varies the value of i with each iteration. Any help much appreciated!

Ruari Rhodes
  • 59
  • 1
  • 6
  • 2
    This is not a reproducible example as you haven't provided any data. May I suggest you check out this [LINK](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on making a reproducible example. – Tyler Rinker May 14 '12 at 17:18
  • As mentioned you should probably really be using lists to solve this problem. If we knew what your data looked like we could probably help you even more. For example from what I can gather what you want to do would probably be more cleanly done using `split` and `lapply`. – Dason May 14 '12 at 18:05
  • If you can make a data frame in a way that each column represents a month of a particular year, then you can just get the sum with the `summary(data.frame)` – Subs May 14 '12 at 18:22
  • @Subs, he wants to aggregate by year, but only calculate mamj totals. This is a job for *Split-Apply-Combine*! See my ddply one-liner below! – smci May 14 '12 at 22:43
  • OP, avoid using loops whenever you can vectorize, that's the power of R. With your permission I would like to retag from *'variables','loops','concatenation'* to *'vectorization','loops','plyr'*? – smci May 14 '12 at 22:46
  • OP, another habit to unlearn when migrating to R is creating loads of intermediate columns, or big-ass temporary arrays with intermediate results. `ddply()` can give you tons of things with a one-liner, and it (correctly) hides its temporaries from you the user, so the code is waaay more legible (/extensible/reusable/...). – smci May 14 '12 at 22:49
  • Hi all, cheers for the help so far; you might have guessed I'm pretty new to R (and programming in general)! Feel free to retag as you see necessary - I'm just working on getting some usable data for you to see what I'm actually trying to achieve, will edit post soon. – Ruari Rhodes May 15 '12 at 09:11
  • @OP you may not have noticed, but 5 days ago I gave an answer that not just answered the spirit of your your question but went 1000% beyond it and solved your entire problem in a ddply one-liner, with a ton of things-I-learned-the-hard-way tips on how to adapt your coding style to R idiom, split-apply-combine, how to use NAs properly,... all of which took me 2 hours to code, research and verify... you did notice that, right? – smci May 19 '12 at 01:04

3 Answers3

4

This is totally begging for a dataframe, then it's just this one-liner with power-tools like ddply (amazingly powerful):

tot_mamj <- ddply(rain[rain$month %in% 3:6,-2], 'year', colwise(sum))

giving your aggregate of total for M/A/M/J, by year:

   year station_1 station_2 station_3 station_4 station_5 ...
1  1972  8.618960  5.697739 10.083192  9.264512 11.152378 ...
2  1973 18.571748 18.903280 11.832462 18.262272 10.509621 ...
3  1974 22.415201 22.670821 32.850745 31.634717 20.523778 ...
4  1975 16.773286 17.683704 18.259066 14.996550 19.007762 ...
...

Below is perfectly working code. We create a dataframe whose col.names are 'station_n'; also extra columns for year and month (factor, or else integer if you're lazy, see the footnote). Now you can do arbitrary analysis by month or year (using plyr's split-apply-combine paradigm):

require(plyr) # for d*ply, summarise
#require(reshape) # for melt

# Parameterize everything here, it's crucial for testing/debugging
all_years <- c(1970:2011)
nYears <- length(all_years)  
nStations <- 101
# We want station names as vector of chr (as opposed to simple indices)
station_names <- paste ('station_', 1:nStations, sep='')

rain <- data.frame(cbind(
  year=rep(c(1970:2011),12),
  month=1:12
))
# Fill in NAs for all data
rain[,station_names] <- as.numeric(NA)
# Make 'month' a factor, to prevent any numerical funny stuff e.g accidentally 'aggregating' it
rain$month <- factor(rain$month)

# For convenience, store the row indices for all years, M/A/M/J
I.mamj <- which(rain$month %in% 3:6)

# Insert made-up seasonal data for M/A/M/J for testing... leave everything else NA intentionally
rain[I.mamj,station_names] <- c(3,5,9,6) * runif(4*nYears*nStations)

# Get our aggregate of MAMJ totals, by year
# The '-2' column index means: "exclude month, to prevent it also getting 'aggregated'"
excludeMonthCol = -2
tot_mamj <- ddply(rain[rain$month %in% 3:6, excludeMonthCol], 'year', colwise(sum))

# voila!!
#    year station_1 station_2 station_3 station_4 station_5
# 1  1972  8.618960  5.697739 10.083192  9.264512 11.152378
# 2  1973 18.571748 18.903280 11.832462 18.262272 10.509621
# 3  1974 22.415201 22.670821 32.850745 31.634717 20.523778
# 4  1975 16.773286 17.683704 18.259066 14.996550 19.007762

As a footnote, before I converted month from numeric to factor, it was getting silently 'aggregated' (until I put in the '-2': exclude column reference). However, better still is when you make it a factor, it will refuse point-blank to be aggregate'd, and throw an error (which is desirable for debugging):

 ddply(rain[rain$month %in% 3:6, ], 'year', colwise(sum))
Error in Summary.factor(c(3L, 3L, 3L, 3L, 3L, 3L), na.rm = FALSE) : 
  sum not meaningful for factors
smci
  • 32,567
  • 20
  • 113
  • 146
2

For your original question, use get():

i <- 10
var <- paste("test", i, sep="_")
assign(10, var)
get(var)

As David said, this is probably not the best path to be taking, but it can be useful at times (and IMO the assign/get construct is far better than eval(parse))

geoffjentry
  • 4,674
  • 3
  • 31
  • 37
  • True with regard to `get`. However, when all you're doing is saving one variable per index, it is very silly not to use the built in types such as lists and matrices. – David Robinson May 14 '12 at 17:57
  • Thus "As David said, this is probably not the best path to be taking" :) – geoffjentry May 14 '12 at 18:18
  • `data.frame` has the advantage over array that we can use heterogeneous columns, so we can have columns for both 'year' and 'month' as factors... then we can arbitrarily split-apply-combine by year, month, or any arbitrary subset thereof. Whereas array is really pretty limited for data analysis. – smci May 14 '12 at 21:54
  • The thing is that he (originally) asked how to get values out of arbitrary variable names, not how he should be structuring his question. As I stated in my post, it is pretty clear that he wasn't going about things the right way (See Lumley's fortune() quip about eval(parse())). Answering the (original) Q with data.frames and such does not actually answer the (original) Q. – geoffjentry May 15 '12 at 17:46
1

Why are you using assign to create variables like station1, station2, station_3_mamj and so on? It would be much easier and more intuitive to store them in a list, like stations[[1]], stations[[2]], stations_mamj[[3]], and such. Then each could be accessed using their index.

Since it looks like each piece of per-station data you're working with is a matrix of the same size, you could even deal with them as a three-dimensional matrix.

ETA: Incidentally, if you really want to solve the problem this way, you would do:

eval(parse(text=paste("station", i, "mamj", sep="_")))

But don't- using eval is almost always bad practices, and will make it difficult to do even simple operations on your data.

David Robinson
  • 77,383
  • 16
  • 167
  • 187