2

I am new to R. I would like to transform a binary matrix like this:
example:

"   1874 1875 1876 1877 1878 .... 2009  
F     1     0     0     0     0   ...  0
E     1     1     0     0     0   ...  0
D     1     1     0     0     0   ...  0
C     1     1     0     0     0   ...  0
B     1     1     0     0     0   ...  0
A     1     1     0     0     0   ...  0"

Since, columns names are years I would like to aggregate them in decades and obtain something like:

"1840-1849 1850-1859 1860-1869 .... 2000-2009
F     1     0     0     0     0   ...  0
E     1     1     0     0     0   ...  0
D     1     1     0     0     0   ...  0
C     1     1     0     0     0   ...  0
B     1     1     0     0     0   ...  0
A     1     1     0     0     0   ...  0"

I am used to python and do not know how to do this transformation without making loops! Thanks, isabel

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • 1
    what is the aggregating function? sum? mean? Perhaps make your question [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? – Chase Mar 28 '13 at 14:05
  • 1
    How do you aggreagate the years to one year, e.g. if I have a vector of `c(1,1,1,1,1,0,0,0,0,0)` for the years 1840 - 1849, what is the resulting number? Minimum, Maximum, Mode, Median? – Simon O'Hanlon Mar 28 '13 at 14:06

2 Answers2

2

It is unclear what aggregation you want, but using the following dummy data

set.seed(42)
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24

The following counts events in each 10-year period.

Get the years as a numeric variable

years <- as.numeric(names(df))

Next we need an indicator for the start of each decade

ind <- seq(from = signif(years[1], 3), to = signif(tail(years, 1), 3), by = 10)

We then apply over the indices of ind (1:(length(ind)-1)), select columns from df that are the current decade and count the 1s using rowSums.

tmp <- lapply(seq_along(ind[-1]),
              function(i, inds, data) {
                rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)])
              }, inds = ind, data = df)

Next we cbind the resulting vectors into a data frame and fix-up the column names:

out <- do.call(cbind.data.frame, tmp)
names(out) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
out

This gives:

> out
  1870-1879 1880-1889 1890-1899
1         4         5         6
2         4         6         6
3         2         5         5
4         5         5         7
5         3         3         7
6         5         5         4

If you want simply a binary matrix with a 1 indicating at least 1 event happened in that decade, then you can use:

tmp2 <- lapply(seq_along(ind[-1]),
               function(i, inds, data) {
                 as.numeric(rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)]) > 0)
               }, inds = ind, data = df)
out2 <- do.call(cbind.data.frame, tmp2)
names(out2) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
out2

which gives:

> out2
  1870-1879 1880-1889 1890-1899
1         1         1         1
2         1         1         1
3         1         1         1
4         1         1         1
5         1         1         1
6         1         1         1

If you want a different aggregation, then modify the function applied in the lapply call to use something other than rowSums.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • Thanks. I want to have 1 if in resulting vector (ex c(1,1,1,1,1,0,0,0,0,0) there are any 1. Maybe using max(). I will try your suggestion. – user2219894 Mar 28 '13 at 15:07
  • My second example covers that. `as.numeric(rowSums(x) > 0)` is essentially what I use there. `max` only works if you know the data only contains `1`s and `0`s, though `max` could be more efficient than the `rowSums` version, but you would need `apply(x, 1, max)`, i.e. apply `max` to each row. – Gavin Simpson Mar 28 '13 at 15:13
  • @GavinSimpson the values in `out` don't correspond to the sample data you generate. – Matthew Plourde Mar 28 '13 at 16:28
  • @MatthewPlourde The code I showed gives the correct output but when creating the answer I pasted in code from data generated under a different seed. Thanks again for pointing that out - I've now updated the outputs and `out` matches your answer for the given input data. – Gavin Simpson Mar 28 '13 at 16:45
1

This is another option, using modular arithmetic to aggregate the columns.

# setup, borrowed from @GavinSimpson
set.seed(42)
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24

result <- do.call(cbind, 
    by(t(df), as.numeric(names(df)) %/% 10 * 10, colSums))

# add -xxx9 to column names, for each decade
dimnames(result)[[2]] <- paste(colnames(result), as.numeric(colnames(result)) + 9, sep='-')

#    1870-1879 1880-1889 1890-1899
# V1         4         5         6
# V2         4         6         6
# V3         2         5         5
# V4         5         5         7
# V5         3         3         7
# V6         5         5         4

If you wanted to aggregate with something other than sum, replace the call to colSums with something like function(cols) lapply(cols, f), where f is the aggregating function, e.g., max.

Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113