I have an R dataframe that I scraped from the internet using readHTMLTable()
in the XML
package. The table looks like the following excerpt with multiple variables/columns for population and year. (Note that the years are not duplicated across columns and represent a unique identifier for population.)
year1 pop1 year2 pop2 year3 pop3
1
2 16XX 4675,0 1900 6453,0 1930 9981,2
3 17XX 4739,3 1901 6553,5 1931 ...
4 17XX 4834,0 1902 6684,0 1932
5 180X 4930,0 1903 6818,0 1933
6 180X 5029,0 1904 6955,0 1934
7 181X 5129,0 1905 7094,0 1935
8 181X 5231,9 1906 7234,7 1936
9 182X 5297,0 1907 7329,0 1937
10 182X 5362,0 1908 7422,0 1938
I would like to reorganize the data into just two columns, one for year and one for population that looks like the following:
year pop
1
2 16XX 4675,0
3 17XX 4739,3
4 17XX 4834,0
5 180X 4930,0
6 180X 5029,0
7 181X 5129,0
8 181X 5231,9
9 182X 5297,0
10 182X 5362,0
11 1900 6453,0
12 1901 6553,5
13 1902 6684,0
... ... ...
21 1930 9981,2
22 ...
The values from the variables/columns year2
and year3
are appended below year1
, as are the corresponding population values.
I have considered the following:
(1) Looping over the population and year columns (n>2
) and adding those values as new observations to year1 and population1 will work, but this seems unnecessarily cumbersome.
(2) I have tried melt as below, but either it cannot handle the id variable split across multiple columns, or I am not implementing it correctly.
df.melt <- melt(df, id=c("year1", "year2",...)
(3) Lastly, I considered pulling out each year column as its own vector, and appending each of those vectors together as here:
year.all <- c(df$year1, df$year2,...)
However, the above returns the following for year.all
[1] 1 2 3 3 4 4 5 5 6 6 7 8 8 9 9 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 1 2 ...
rather than this
[1] 16XX 17XX 17XX 180X 180X 181X 181X 182X 182X 1900 1901 1902...
If there is a straightforward way of accomplishing this reorganizing I would love to learn it. Many thanks for the help.