I want to create a look-up table from a dataframe of changes. Each row of the original dataframe signifies a change in the coding for a given district. That dataset covers some time period, 2009 to 2019. While a district may experience several changes during that time period, I want the 2009 and 2019 codings of each district. That is, the first and latest coding.
The dataframe covers hundreds of districts. Some districts may go through just a single change, while others go through multiple. A district can be merged or split into multiple others.
The ideal look-up table would look as follows:
coding_2009 | coding_2019 |
---|---|
00QR | S12000047 |
00QR | S12000048 |
00RB | S12000047 |
00RB | S12000048 |
coding_2009
is the district's coding as of 2009, and coding_2019
is its latest coding as of 2019.
The original dataframe (a subset), where each row shows a change, looks like:
past | new | date |
---|---|---|
00QR | S12000015 | 2009-01-01 |
S12000015 | S12000047 | 2018-02-02 |
S12000015 | S12000048 | 2018-02-02 |
00RB | S12000015 | 2009-01-01 |
S12000024 | S12000047 | 2018-02-02 |
S12000024 | S12000048 | 2018-02-02 |
For each row, past
is the code that is recoded into new
as of date
.
For example, district 00QR
is turned into S12000015
, which is later split into S12000047
and S12000048
.
I've been dealing with this problem for weeks now, trying out different ad hoc versions but none seem to consistently work out. Please note that the code needs to consider that some districts experience just one change, whereas others may experience two or more changes. Districts can also be split or merged, as the example shows.
The ideal answer would use tidyverse
.
For a reprex, I have subset a selection of districts below.
Thanks for your help! It would be appreciated immensely.
Reprex data:
(you can also go above and beyond and use the original dataset, Changes.csv
. See link below)
# Library tibble (a part of tidyverse) is needed to copy paste reprex data
#install.packages("tibble") # if you need to install it
library(tibble)
data <- tibble::tribble(
~past, ~new, ~date,
"00RJ", "S12000013", "2009-01-01",
"00QR", "S12000015", "2009-01-01",
"00RB", "S12000024", "2009-01-01",
"13UD", "E07000015", "2009-01-01",
"15UH", "E07000025", "2009-01-01",
"00HC", "E06000024", "2009-01-01",
"00KG", "E06000034", "2009-01-01",
"19UD", "E07000049", "2009-01-01",
"19UE", "E07000050", "2009-01-01",
"19UG", "E07000051", "2009-01-01",
"19UH", "E07000052", "2009-01-01",
"19UJ", "E07000053", "2009-01-01",
"E07000017", "E06000049", "2009-04-01",
"E07000025", "E06000053", "2009-04-01",
"E07000014", "E06000049", "2009-04-01",
"E07000015", "E06000049", "2009-04-01",
"S12000013", "S12000013", "2015-06-16",
"S12000013", "S12000013", "2015-11-01",
"S12000015", "S12000047", "2018-02-02",
"S12000024", "S12000047", "2018-02-02",
"S12000015", "S12000048", "2018-02-02",
"S12000024", "S12000048", "2018-02-02",
"E07000049", "E06000059", "2019-04-01",
"E07000050", "E06000059", "2019-04-01",
"E07000053", "E06000059", "2019-04-01",
"E07000051", "E06000059", "2019-04-01",
"E07000052", "E06000059", "2019-04-01"
)
# Convert date to Date (after being copy pasted as tibble)
data$date <- as.Date(data$date)
For anyone interested, this data is from the UK's Code History Database
. You can download the zip from the link below. It's the file named Changes.csv
: https://geoportal.statistics.gov.uk/datasets/code-history-database-december-2019-for-the-united-kingdom. Note, in Changes.csv
, past
is named geogcd_p
, new
is geogcd
and date
is oper_date
.