3

I have an R dataframe that I scraped from the internet using readHTMLTable() in the XML package. The table looks like the following excerpt with multiple variables/columns for population and year. (Note that the years are not duplicated across columns and represent a unique identifier for population.)

        year1   pop1      year2   pop2     year3   pop3     
1                                                        
2       16XX    4675,0    1900    6453,0    1930   9981,2       
3       17XX    4739,3    1901    6553,5    1931   ...      
4       17XX    4834,0    1902    6684,0    1932   
5       180X    4930,0    1903    6818,0    1933        
6       180X    5029,0    1904    6955,0    1934        
7       181X    5129,0    1905    7094,0    1935
8       181X    5231,9    1906    7234,7    1936
9       182X    5297,0    1907    7329,0    1937
10      182X    5362,0    1908    7422,0    1938

I would like to reorganize the data into just two columns, one for year and one for population that looks like the following:

        year    pop     
1                                                        
2       16XX    4675,0
3       17XX    4739,3  
4       17XX    4834,0  
5       180X    4930,0
6       180X    5029,0  
7       181X    5129,0
8       181X    5231,9  
9       182X    5297,0
10      182X    5362,0  
11      1900    6453,0
12      1901    6553,5
13      1902    6684,0
...     ...     ...
21      1930    9981,2
22      ... 

The values from the variables/columns year2 and year3 are appended below year1, as are the corresponding population values.

I have considered the following:

(1) Looping over the population and year columns (n>2) and adding those values as new observations to year1 and population1 will work, but this seems unnecessarily cumbersome.

(2) I have tried melt as below, but either it cannot handle the id variable split across multiple columns, or I am not implementing it correctly.

df.melt <- melt(df, id=c("year1", "year2",...)

(3) Lastly, I considered pulling out each year column as its own vector, and appending each of those vectors together as here:

year.all <- c(df$year1, df$year2,...)

However, the above returns the following for year.all

[1]  1  2  3  3  4  4  5  5  6  6  7  8  8  9  9  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24  1  1  2 ...

rather than this

[1] 16XX 17XX 17XX 180X 180X 181X 181X 182X 182X 1900 1901 1902...

If there is a straightforward way of accomplishing this reorganizing I would love to learn it. Many thanks for the help.

Jaap
  • 81,064
  • 34
  • 182
  • 193

3 Answers3

7

Using the new feature in melt from data.table v1.9.5+:

require(data.table) # v1.9.5+
melt(setDT(df), measure = patterns("^year", "^pop"), value.name = c("year", "pop"))

You can find the rest of the vignettes here.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • Thanks so much. Does the function `patterns()` come with the package `data.table`? I'm receiving the following error message: `Error in melt.data.table(setDT(table), measure = patterns("^year", "^population"), : could not find function "patterns"` despite using the following `install.packages(data.table)` `library(data.table)` `require(data.table)`. – user2489854 Jul 19 '15 at 19:59
  • Ah, it is not in v1.9.4. Will update. – user2489854 Jul 19 '15 at 23:06
6

If the 'year', 'pop', columns are alternating, we can subset with c(TRUE, FALSE) to get the columns 1, 3, 5,..etc. and c(FALSE, TRUE) to get 2, 4, 6,.. due to the recycling. Then, we unlist the columns and create a new 'data.frame.

 df2 <- data.frame(year=unlist(df1[c(TRUE, FALSE)]), 
                  pop=unlist(df1[c(FALSE, TRUE)]))
 row.names(df2) <- NULL
 head(df2)
 #   year    pop
 #1            
 #2 16XX 4675,0
 #3 17XX 4739,3
 #4 17XX 4834,0
 #5 180X 4930,0
 #6 180X 5029,0

Or another option is

library(splitstackshape)
merged.stack(transform(df1, id=1:nrow(df1)), var.stubs=c('year', 'pop'), 
        sep='var.stubs')[order(.time_1), 3:4, with=FALSE]

data

df1 <- structure(list(year1 = c("", "16XX", "17XX", "17XX", "180X", 
"180X", "181X", "181X", "182X", "182X"), pop1 = c("", "4675,0", 
"4739,3", "4834,0", "4930,0", "5029,0", "5129,0", "5231,9", "5297,0", 
"5362,0"), year2 = c(NA, 1900L, 1901L, 1902L, 1903L, 1904L, 1905L, 
1906L, 1907L, 1908L), pop2 = c("", "6453,0", "6553,5", "6684,0", 
"6818,0", "6955,0", "7094,0", "7234,7", "7329,0", "7422,0"), 
year3 = c(NA, 1930L, 1931L, 1932L, 1933L, 1934L, 1935L, 1936L, 
1937L, 1938L), pop3 = c("", "9981,2", "", "", "", "", "", 
"", "", "")), .Names = c("year1", "pop1", "year2", "pop2", 
"year3", "pop3"), class = "data.frame", row.names = c(NA, -10L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • The strange thing is that `df2$year` gives me: `1 2 2 3 3 4 4 5 5 1900 1901 1902 1903 1904 1905 1906 1907 1908 1930 1931 1932 1933 1934 1935 1936 1937 1938`, which is not the intended result. – Jaap Jul 19 '15 at 19:14
  • @Jaap I had `character` columns for 'year'. Can you check whether you had 'factor'? – akrun Jul 19 '15 at 19:15
  • Checked, `df$year1` was indeed a factor-variable. – Jaap Jul 19 '15 at 19:17
  • @Jaap I tried `df1[c(1,3,5)] <- lapply(df1[c(1,3,5)], factor)` still I am getting the expected one – akrun Jul 19 '15 at 19:18
  • When I use your data, I get the same result. I think the problem originates from the fact that I removed the first blank line before reading in the data. – Jaap Jul 19 '15 at 19:22
  • 1
    @Jaap In that case, I think the first column alone will be factor, and the rest of the year columns as numeric. So, when place it together it will convert to the numeric levels in factor column. – akrun Jul 19 '15 at 19:26
  • 1
    Now checked all variables, classes were indeed different for the different `year` columns. That certainly explains the odd behavior in my case. Learned again something today :-) – Jaap Jul 19 '15 at 19:41
2

Another option is to use split.default to split the dataframe in a list of dataframes and then bind them together:

lst <- lapply(split.default(df1, sub('.*(\\d)', '\\1', names(df1))),
              setNames, c('year','pop'))

do.call(rbind, lst)

which gives the desired result:

    year     pop
1.1 16XX  4675,0
1.2 17XX  4739,3
1.3 17XX  4834,0
1.4 180X  4930,0
1.5 180X  5029,0
1.6 181X  5129,0
1.7 181X  5231,9
1.8 182X  5297,0
1.9 182X  5362,0
2.1 1900  6453,0
2.2 1901  6553,5
2.3 1902  6684,0
2.4 1903  6818,0
2.5 1904  6955,0
2.6 1905  7094,0
2.7 1906  7234,7
2.8 1907  7329,0
2.9 1908  7422,0
3.1 1930  9981,2
3.2 1931 10583,5
3.3 1932  8671,0
3.4 1933  9118,0
3.5 1934  9625,0
3.6 1935  8097,0
3.7 1936  7984,7
3.8 1937  8729,0
3.9 1938 10462,0

You could also use rbindlist from the data.table package for the last step:

library(data.table)
rbindlist(lst)

Used data:

df1 <- structure(list(year1 = c("16XX", "17XX", "17XX", "180X", "180X", "181X", "181X", "182X", "182X"),
                      pop1 = c("4675,0", "4739,3", "4834,0", "4930,0", "5029,0", "5129,0", "5231,9", "5297,0", "5362,0"),
                      year2 = c(1900L, 1901L, 1902L, 1903L, 1904L, 1905L, 1906L, 1907L, 1908L),
                      pop2 = c("6453,0", "6553,5", "6684,0", "6818,0", "6955,0", "7094,0", "7234,7", "7329,0", "7422,0"), 
                      year3 = c(1930L, 1931L, 1932L, 1933L, 1934L, 1935L, 1936L, 1937L, 1938L),
                      pop3 = c("9981,2", "10583,5", "8671,0", "9118,0", "9625,0", "8097,0", "7984,7", "8729,0", "10462,0")),
                 .Names = c("year1", "pop1", "year2", "pop2", "year3", "pop3"), class = "data.frame", row.names = c(NA, -9L))
Jaap
  • 81,064
  • 34
  • 182
  • 193