0

Sorry for gibberish title, I'm bad at being succinct when I don't understand something fully. The Access database this originates from is broken up into tables keyed by IDs unique in one table, but non-unique in others, in this case the parent contains all of the surveys, each with a unique id, and the child table contains all of the sample data for each survey (roughly 10 samples per survey) but none of the survey data such as region or date. This leads me to the problem, being that I want to be able to attribute data from the survey to each sample for analysis. I've tried Samples$Year <- Surveys$Year[Samples$SurveyID %in% Surveys$SurveyID] which doesn't work due to the unequal length of the frames, and I've tried a very long roundabout and slow nested loop with if/then statements, but there has to be a better way I'm not familiar with as I am fairly new to R and self taught. What would the best way to accomplish either adding the correct column to the child dataframe, or pulling the relevant associated data when needed for each row?

What I have is something like

Surveys:
    SurveyID    Date         Region
1    10         12-MAR-2020  A
2    11         16-APR-2020  B
3    12         20-APR-2020  C

Samples:
  SurveyID   SampleID   Count    
1 10         123        23
2 10         124        54
3 11         125        65
4 11         126        0
5 11         127        3
6 12         128        43

and I want to ideally join them to create something like

Samples:
  SurveyID   SampleID   Count    Date        Region
1 10         123        23       12-MAR-2020 A
2 10         124        54       12-MAR-2020 A
3 11         125        65       16-APR-2020 B
4 11         126        0        16-APR-2020 B
5 11         127        3        16-APR-2020 B
6 12         128        43       20-APR-2020 C

I've tried searching for similar problems, and I'm either using the wrong keywords due to inexperience, barking up the wrong tree, or haven't come across someone in a similar predicament. It may also be worth noting that I did not design the database, and although I do have some SQL experience it's not much to shake a stick at.

  • 1
    you need to make this question reproducible by including your data (use the `dput(Samples)` and include the output in yur question) also include the desired output and and code you have tried. ALso what exactly do you mean by parent/child tables? – morgan121 Apr 29 '20 at 00:26
  • Oh shoot, will do, thanks! – Moss_and_Bones Apr 29 '20 at 00:27
  • 1
    looks like you want a join. try `merge(Samples, Surveys, all.x = T)` – morgan121 Apr 29 '20 at 00:37
  • I'm not entirely sure if it's the right terminology, but by parent/child I mean that the data is hierarchical, each entry in the "parent" table contains data relevant to a subset of entries in the "child" table, keyed by the parentID. It's a byproduct of the structure of the MS Access database that the data originated from, I had to export a lot of data to a flash drive in Excel workbooks when our offices shut down to be able to work from home. – Moss_and_Bones Apr 29 '20 at 00:40
  • Thanks! I'll give that a try. If I wanted to join with only specific columns would it be best to use `select()` from the `dplyr` package? – Moss_and_Bones Apr 29 '20 at 00:43
  • 1
    This looks like a join, if you want to use `dplyr` you can use `inner_join(Surveys, Samples, by = 'SurveyID')` OR even `left_join` in this case works the same. – Ronak Shah Apr 29 '20 at 00:49

1 Answers1

0

This appears to be a text-book "merge" or "join", as @RAB suggested. Using base R:

merge(Surveys, Samples, by = "SurveyID", all = TRUE)
#   SurveyID        Date Region SampleID Count
# 1       10 12-MAR-2020      A      123    23
# 2       10 12-MAR-2020      A      124    54
# 3       11 16-APR-2020      B      125    65
# 4       11 16-APR-2020      B      126     0
# 5       11 16-APR-2020      B      127     3
# 6       12 20-APR-2020      C      128    43

If you're using or learning anything from the tidyverse, then this renders the same results.

library(dplyr)

full_join(Surveys, Samples, by = "SurveyID")

Similarly, if speed and/or data size is a concern, then perhaps you're learning data.table, in which case this is the same, in its more terse language:

library(data.table)
SurveysDT <- as.data.table(Surveys)
SamplesDT <- as.data.table(Samples)

SurveysDT[SamplesDT, on = .(SurveyID)]

Data:

Surveys <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
    SurveyID    Date         Region
1    10         12-MAR-2020  A
2    11         16-APR-2020  B
3    12         20-APR-2020  C")
Samples <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
  SurveyID   SampleID   Count    
1 10         123        23
2 10         124        54
3 11         125        65
4 11         126        0
5 11         127        3
6 12         128        43")
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thank you so much! That should do the trick perfectly. Like a lot of things with R, it seems like the best answer is often staring me in the face! – Moss_and_Bones Apr 29 '20 at 00:52
  • I find the "merge" concept to be difficult to pickup for some, and once mastered, such a powerful mechanism. "Staring you in the face" is relative ... I've been using R solidly for almost a decade, and I still learn new (to me) functions frequently, most of which have been there the whole time. – r2evans Apr 29 '20 at 00:58