0

I have a large database, a subset of which looks like this

ID    year        value1      value2
 1    2000   203,305,701     1, 2, 1
 1    2001       203,504        1, 1
 1    2002           203           1
 2    2010           245           3
 2    2011       245,332        2, 1
 2    2012           332           3
 2    2013           332           2
 2    2014       245,332        2, 1

Reproducible code:

structure(list( 
ID = c("1", "1", "1", "2", "2", "2", "2", "2"), 
year = c("2000", "2001", "2002", "2010", "2011", "2012", 
"2013", "2014"), value1 = c("203, 305, 701", 
"203, 504", "203", "245", "245, 332", 
"332", "332", "245, 332"), value2 = c("1, 2, 1", 
"1, 1", "1", "3", "2, 1", "3", "2", "2, 1")), class = "data.frame", row.names = c(NA, -8L))

"value1" and "value2" contain comma-separated values. The objective is to simplify the "value1" column to a single value. The algorithm I've thought out goes like this:

  1. Check for previous and next values for each row while grouping by ID (taking intersections: i.e. the common value in two consecutive rows). For example, for row 5: The intersection of {245, 332} with the previous row {245} for value1 is 245, while with the next row {332} it is 332
  2. Prefer next value over previous value for selection. I want to prioritise the next value i.e. {332} in this split decision.
  3. If either intersection does not narrow down to a single value, select value1 based on max(value2). If value2 does not have a maximum, select randomly. The third step does not come into play since a single value is selected based on the first two steps.

The algorithm continues to the next row as soon as a single value is reached. Previous and next refers to the preceding and the following row respectively.

Similarly, for row 1: The intersection is 203 with only the next row, as we stopped the algorithm as soon as we arrived at a single value.

The final data should look like this

ID    year        value1      value2
 1    2000           203     1, 2, 1
 1    2001           203        1, 1
 1    2002           203           1
 2    2010           245           3
 2    2011           332        2, 1
 2    2012           332           3
 2    2013           332           2
 2    2014           332        2, 1

I tried writing a basic code in R to loop over each row grouping by "ID" and going through each year since I have no idea which package to use for this and going case by case, but it seems to me that this might not be the most efficient method. (I am also very new to R)

shinama99
  • 51
  • 6
  • 1
    Can you please provide a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) (**in code form** i.e. output from `dput` (use `head` if data is too large). It's crucial what type `value1` and `value2` are (lists, strings, etc). Adding a MRE and an example of the desired output (in code form, not tables and pictures) makes it much easier for others to find and test an answer to your question. That way you can help others to help you! – dario Oct 18 '21 at 13:32
  • Additionally could you explain your algorithm in more detail? _Check for previous and next values for each row (taking subsets)_ How does this work for your first or your fifth row? – Martin Gal Oct 18 '21 at 13:37
  • Thanks for the suggestions, have made some edits. Hope it helps now – shinama99 Oct 18 '21 at 14:40
  • I'm still not sure if I understand what you want. For one, what do you mean by *" (taking subsets)"* (Do you mean *grouped by ID* ? And the answer to the question how to deal with the first and last row (i.e. what is *previous" and *next* is not clear to me yet... – dario Oct 18 '21 at 15:13
  • Ah, sorry, just realised I used the wrong terminology. I meant the _intersection_, not subsets. So for row 1 {203,305,701}, the next value is row 2 {203,504}, the intersection of which is {203}. By next and previous I mean the following and preceding rows. – shinama99 Oct 18 '21 at 15:36
  • *"By next and previous I mean the following and preceding rows."* - I understood as much ;) What is still unclear is what is the previous row for the first line (and next for last line respectively). Just because the "algorithm" stopped for the example data doesn't mean it will always stop. I have a feeling that you should just go with your loop solution that works for you. I still don't understand the logic behind all this. For example: What guarantees the ordering? Can we really ignore ID, Year and group ? And if so, why are they in the **minimal** reproducible example? – dario Oct 18 '21 at 15:44
  • Also, I saw that you *just now* added the constraint about grouping by ID and Year... But now there is only one row per group - it is now even more unc lear to me what you want... (To clarify: If we group the example by ID, year and group we only have one row per grouping: There is no next or previous row. If we **sort** by ID, year and group however, that would be different. But as it is now, I don't get it. ; – dario Oct 18 '21 at 15:46
  • Okay, so I removed group all together to simplify things, I will merge it later (It is the same for a single ID). I hope it's easier to understand now - just grouping by ID and year. – shinama99 Oct 18 '21 at 15:53
  • Ok, last comment from me: If we group the example data by ID and year; there is only one row per group ( = no previous and next, and therefore no intersection). – dario Oct 18 '21 at 16:16
  • Grouping just by ID, my brain isn't working today. Sorry for all the confusion – shinama99 Oct 18 '21 at 16:18

0 Answers0