How to create a look-up table from a dataframe of changes?

Question

I want to create a look-up table from a dataframe of changes. Each row of the original dataframe signifies a change in the coding for a given district. That dataset covers some time period, 2009 to 2019. While a district may experience several changes during that time period, I want the 2009 and 2019 codings of each district. That is, the first and latest coding.

The dataframe covers hundreds of districts. Some districts may go through just a single change, while others go through multiple. A district can be merged or split into multiple others.

The ideal look-up table would look as follows:

coding_2009	coding_2019
00QR	S12000047
00QR	S12000048
00RB	S12000047
00RB	S12000048

coding_2009 is the district's coding as of 2009, and coding_2019 is its latest coding as of 2019.

The original dataframe (a subset), where each row shows a change, looks like:

past	new	date
00QR	S12000015	2009-01-01
S12000015	S12000047	2018-02-02
S12000015	S12000048	2018-02-02
00RB	S12000015	2009-01-01
S12000024	S12000047	2018-02-02
S12000024	S12000048	2018-02-02

For each row, past is the code that is recoded into new as of date.

For example, district 00QR is turned into S12000015, which is later split into S12000047 and S12000048.

I've been dealing with this problem for weeks now, trying out different ad hoc versions but none seem to consistently work out. Please note that the code needs to consider that some districts experience just one change, whereas others may experience two or more changes. Districts can also be split or merged, as the example shows.

The ideal answer would use tidyverse.

For a reprex, I have subset a selection of districts below.

Thanks for your help! It would be appreciated immensely.

Reprex data: (you can also go above and beyond and use the original dataset, Changes.csv. See link below)

# Library tibble (a part of tidyverse) is needed to copy paste reprex data
#install.packages("tibble") # if you need to install it
library(tibble)

data <- tibble::tribble(
        ~past,        ~new,        ~date,
       "00RJ", "S12000013", "2009-01-01",
       "00QR", "S12000015", "2009-01-01",
       "00RB", "S12000024", "2009-01-01",
       "13UD", "E07000015", "2009-01-01",
       "15UH", "E07000025", "2009-01-01",
       "00HC", "E06000024", "2009-01-01",
       "00KG", "E06000034", "2009-01-01",
       "19UD", "E07000049", "2009-01-01",
       "19UE", "E07000050", "2009-01-01",
       "19UG", "E07000051", "2009-01-01",
       "19UH", "E07000052", "2009-01-01",
       "19UJ", "E07000053", "2009-01-01",
  "E07000017", "E06000049", "2009-04-01",
  "E07000025", "E06000053", "2009-04-01",
  "E07000014", "E06000049", "2009-04-01",
  "E07000015", "E06000049", "2009-04-01",
  "S12000013", "S12000013", "2015-06-16",
  "S12000013", "S12000013", "2015-11-01",
  "S12000015", "S12000047", "2018-02-02",
  "S12000024", "S12000047", "2018-02-02",
  "S12000015", "S12000048", "2018-02-02",
  "S12000024", "S12000048", "2018-02-02",
  "E07000049", "E06000059", "2019-04-01",
  "E07000050", "E06000059", "2019-04-01",
  "E07000053", "E06000059", "2019-04-01",
  "E07000051", "E06000059", "2019-04-01",
  "E07000052", "E06000059", "2019-04-01"
  )

# Convert date to Date (after being copy pasted as tibble)
data$date <- as.Date(data$date)

For anyone interested, this data is from the UK's Code History Database. You can download the zip from the link below. It's the file named Changes.csv: https://geoportal.statistics.gov.uk/datasets/code-history-database-december-2019-for-the-united-kingdom. Note, in Changes.csv, past is named geogcd_p, new is geogcd and dateis oper_date.

Sirius · Accepted Answer · 2021-03-31T15:40:41.027

You are essentially looking at a flat packed tree structure. It is easily graphed using the igraph package:


library(igraph)
g <- dat %>% select( past,new ) %>% t %>% c %>% graph
plot( g )

Now from here on there are many wayt to go about it, but it comes down to a depth first or width first approach to the problem.

It is reasonable to assume that we have several small graphs. A bunch of different codes that have gone through a few changes, rather than a select few codes that have gone through many changes.

This suggests a width first approach, and is solvable by joining the data to itself, hopefully, not too many times:


## work with data.table for that extra speed.
setDT(dat)

## remove duplicate entries of same code
dat <- dat[, .(date=max(date)), by=.(past,new) ]

## these are the roots, all `past` values never present in `new`
roots <- dat[ !past %in% new ]

## likewise, the leaves are those that never appear as `past` , unless they are self referencing.
leaves <- unique( dat[ !new %in% past | new == past, !"past" ], by="new" )


dd <- copy(roots)

## sucessively add next step from the source data till we have arrived at leaves only.
while( !all( dd$new %in% leaves$new ) ) {
    dd <- unique(
        merge( dd, dat, by.x="new", by.y="past", all.x=TRUE )[ , .(date.x, past, new=coalesce(new.y,new), date.y=coalesce(date.y,date.x) ) ]
    )
}

## final cleanup
dd[ order(past), .(coding_2009=past,coding_2019=new) ]

Output:


> dd[ order(past), .(coding_2009=past,coding_2019=new) ]
    coding_2009 coding_2019
 1:        00HC   E06000024
 2:        00KG   E06000034
 3:        00QR   S12000047
 4:        00QR   S12000048
 5:        00RB   S12000047
 6:        00RB   S12000048
 7:        00RJ   S12000013
 8:        13UD   E06000049
 9:        15UH   E06000053
10:        19UD   E06000059
11:        19UE   E06000059
12:        19UG   E06000059
13:        19UH   E06000059
14:        19UJ   E06000059
15:   E07000014   E06000049
16:   E07000017   E06000049

Now I have only looked at the mini dataset, so I have no idea how the code will eprform in the wild, but you could give it a go.

Looking at the image above, we see that there are at most 3 steps for each graph from root to leaf, meaning the above while loop only had to run once.

Amazing! And cool use of igraph. It works great on first try - I'll try again tomorrow. Thanks a lot for your help! — Will M, Mar 31 '21 at 01:14

Will M · Answer 2 · 2021-03-31T13:14:37.317

Sirius provided an amazing answer using data.table. Here I translate that answer into tidyverse:

# Remove duplicate entries of same code
data_sub <- data %>%
  group_by(past, new) %>%
  filter(date == max(date)) %>%
  ungroup()

# Create roots: All past values never present in new
roots <- data_sub %>%
  filter(!past %in% new)

# Create leaves: Those that never appear as past, unless they self reference
leaves <- data_sub %>%
  filter(!new %in% past | new == past) %>%
  select(-past) %>%
  distinct(new, .keep_all = TRUE)

# Copy before loop
dd <- roots

# Successively add next step from source data until we have arrived at leaves only
while(!all(dd$new %in% leaves$new)) {
  
  # Join
  dd_merge <- left_join(dd, data_sub, by = c("new" = "past"))
  
  # Coalesce
  dd_sub <- dd_merge %>%
    transmute(date.x,
              past,
              new = coalesce(new.y, new),
              date.y = coalesce(date.y, date.x))
  
  # Take unique
  dd <- unique(dd_sub)
  
}

How to create a look-up table from a dataframe of changes?

2 Answers2