I have a large database, a subset of which looks like this
ID year value1 value2
1 2000 203,305,701 1, 2, 1
1 2001 203,504 1, 1
1 2002 203 1
2 2010 245 3
2 2011 245,332 2, 1
2 2012 332 3
2 2013 332 2
2 2014 245,332 2, 1
Reproducible code:
structure(list(
ID = c("1", "1", "1", "2", "2", "2", "2", "2"),
year = c("2000", "2001", "2002", "2010", "2011", "2012",
"2013", "2014"), value1 = c("203, 305, 701",
"203, 504", "203", "245", "245, 332",
"332", "332", "245, 332"), value2 = c("1, 2, 1",
"1, 1", "1", "3", "2, 1", "3", "2", "2, 1")), class = "data.frame", row.names = c(NA, -8L))
"value1" and "value2" contain comma-separated values. The objective is to simplify the "value1" column to a single value. The algorithm I've thought out goes like this:
- Check for previous and next values for each row while grouping by ID (taking intersections: i.e. the common value in two consecutive rows). For example, for row 5: The intersection of {245, 332} with the previous row {245} for value1 is 245, while with the next row {332} it is 332
- Prefer next value over previous value for selection. I want to prioritise the next value i.e. {332} in this split decision.
- If either intersection does not narrow down to a single value, select value1 based on max(value2). If value2 does not have a maximum, select randomly. The third step does not come into play since a single value is selected based on the first two steps.
The algorithm continues to the next row as soon as a single value is reached. Previous and next refers to the preceding and the following row respectively.
Similarly, for row 1: The intersection is 203 with only the next row, as we stopped the algorithm as soon as we arrived at a single value.
The final data should look like this
ID year value1 value2
1 2000 203 1, 2, 1
1 2001 203 1, 1
1 2002 203 1
2 2010 245 3
2 2011 332 2, 1
2 2012 332 3
2 2013 332 2
2 2014 332 2, 1
I tried writing a basic code in R to loop over each row grouping by "ID" and going through each year since I have no idea which package to use for this and going case by case, but it seems to me that this might not be the most efficient method. (I am also very new to R)