How to merge a particular row of a group of rows with the same reference #, with a df that contains a single row with the matching reference #

Question

I'm still very new at stack overflow so please let me know if there is a better way to include data or other formatting issues with my question. Thanks!

I have 2 data frames. One contains a single row of data that i need which has a unique reference number.

I need to merge the Ph and Dissolved02 from the first data frame into the one with latitude and longitude. But I only want to the values from the last row of each unique reference number, or in other words the deepest pH and Dissolved02 values. The final data frame will only have one occurrence of each reference number. A sample of each data frame can be created with the following code (maybe a much easier way to input data into stack overflow?)...

sample.df <- readLines(textConnection("BBM2008050101    0.2 B   24.8    52.1    8.2 34.3    6.1 
BBM2008050101   1.0 B   24.8    52.4    8.2 34.5    6.1 
BBM2008050101   1.4 B   24.8    52.4    8.2 34.5    6.1 
BBM2008050102   0.2 B   24.5    53.0    8.1 35.0    6.3 
BBM2008050102   1.0 B   24.5    53.0    8.1 34.9    6.0 
BBM2008050102   1.6 B   24.5    53.0    8.1 35.0    5.9 
BBM2008050103   0.2 B   24.9    51.1    8.2 33.5    6.1 
BBM2008050103   1.0 B   24.9    51.1    8.2 33.5    6.1 
BBM2008050103   1.6 B   24.9    51.1    8.2 33.5    6.1 
BBM2008050104   0.2 B   25.1    51.4    8.2 33.8    6.7 
BBM2008050104   1.0 B   25.1    51.4    8.2 33.8    6.5 
BBM2008050104   1.6 B   25.1    51.4    8.2 33.8    6.5 
BBM2008050105   0.2 B   24.9    51.9    8.1 34.1    7.7 
BBM2008050105   1.0 B   24.9    51.9    8.2 34.1    7.9 
BBM2008050106   0.2 B   25.4    51.1    8.3 33.5    7.0 
BBM2008050106   1.0 B   25.4    51.1    8.3 33.5    6.5 
BBM2008050106   2.0 B   25.4    51.1    8.3 33.5    6.5 
BBM2008050106   2.3 B   25.4    51.1    8.3 33.5    6.4 "))

sample.df <- strsplit(sample.df,"[[:space:]]+")

max.len <- max(sapply(sample.df, length))
corrected.list <- lapply(sample.df, function(x) {c(x, rep(NA, max.len - length(x)))})
df <- do.call(rbind, corrected.list)
colnames(df) <- c("Reference",  "Depth",    "Beg_end",  "Temperature",  "Conductivity", "pH",   "Salinity", "DissolvedO2")
df <- as.data.frame(df)



sample.df2 <- readLines(textConnection("BBM2008050101   301 -83.44165   29.637633   1.6 D
BBM2008050102   301 -83.439717  29.630233   1.8 D
BBM2008050103   301 -83.434017  29.605567   1.8 D
BBM2008050104   301 -83.440067  29.596267   1.8 D
BBM2008050105   301 -83.4346    29.592667   1.2 D
BBM2008050106   300 -83.44555   29.596917   2.5 D"))

sample.df2 <- strsplit(sample.df2,"[[:space:]]+")

max.len2 <- max(sapply(sample.df2, length))
corrected.list2 <- lapply(sample.df2, function(x) {c(x, rep(NA, max.len2 - length(x)))})
df2 <- do.call(rbind, corrected.list2)
colnames(df2) <- c("Reference", "Gear", "Longitude",    "Latitude", "StartDepth",   "Zone")
df2 <- as.data.frame(df2)

Output would be sample.df3 with the deepest Ph and Dissolved02 columns added. Like below but obviously my data frame is much larger and I cannot do this manually.


sample.df3 <- readLines(textConnection("BBM2008050101   301 -83.44165   29.637633   1.6 D  8.2  6.1
BBM2008050102   301 -83.439717  29.630233   1.8 D  8.1  5.9
BBM2008050103   301 -83.434017  29.605567   1.8 D  8.2  6.1
BBM2008050104   301 -83.440067  29.596267   1.8 D  8.2  6.5
BBM2008050105   301 -83.4346    29.592667   1.2 D  8.2  7.9
BBM2008050106   300 -83.44555   29.596917   2.5 D  8.3  6.4"))

sample.df3 <- strsplit(sample.df3,"[[:space:]]+")

max.len3 <- max(sapply(sample.df3, length))
corrected.list3 <- lapply(sample.df3, function(x) {c(x, rep(NA, max.len3 - length(x)))})
df3 <- do.call(rbind, corrected.list3)
colnames(df3) <- c("Reference", "Gear", "Longitude",    "Latitude", "StartDepth",   "Zone", "pH", "Dissolved02")
df3 <- as.data.frame(df3)

If you're looking for an "easier way to input data into Stack Overflow", I suggest posting the output of `dput(your.data.frame)`. The output is in a form that can be copy/pasted into the terminal to replicate the source data frame. If you want to send a snapshot of your data, you can use `head(your.data.frame, num.first.rows)`, or (if you're a baller), take a sample of the data with `sample()`, so use: `dput(your.df[sample[nrow(your.df), num.rows),]`. — chemdork123, Apr 16 '20 at 18:12
Very helpful information to know. Thank you. How do you learn all these simple little functions like dput() when there are so sooo many? — Johnny5ish, Apr 16 '20 at 18:21
The use of `dput` and some other techniques are mentioned in fairly-popular posts like https://stackoverflow.com/q/5963269 as well as SO-basics like [mcve] and https://stackoverflow.com/tags/r/info. — r2evans, Apr 16 '20 at 18:22
I didn't know about `dput` until started looking around and posting in SO. Seems useful! You are probably already familiar with `head()`, but I find using `df[sample(...),]` is better if you're looking to take a random sample of a larger dataset, where just taking the first 20 rows would not represent multiple factors, etc. — chemdork123, Apr 16 '20 at 18:24
I agree, sometimes when all the factors are grouped up, if you just take head() you might get a very similar sample. — Johnny5ish, Apr 16 '20 at 18:34

score 1 · Accepted Answer · answered Apr 16 '20 at 20:00

The below uses dplyr's group_by and summarise to get the last row where a Reference occurs, then filters DF1 on the last rows for each Reference and finally merges in all columns from DF2

library(dplyr)

df$id <- c(1:nrow(df)) # Create ID Column to store row number

# Create a smaller df with just the references and the max row number (which should equal the last occurance)
df1_last_references <- df %>%
  group_by(Reference) %>%
  summarise(id = max(id))

# Filter Original DF on the row numbers matching from the last references
df <- df[which(df$id %in% df1_last_references$id), ]

# merge in the columns from DF2
df3 <- merge(df, df2, all.x = T, by = 'Reference')

head(df3)
      Reference Gear  Longitude  Latitude StartDepth Zone  pH Dissolved02
1 BBM2008050101  301  -83.44165 29.637633        1.6    D 8.2         6.1
2 BBM2008050102  301 -83.439717 29.630233        1.8    D 8.1         5.9
3 BBM2008050103  301 -83.434017 29.605567        1.8    D 8.2         6.1
4 BBM2008050104  301 -83.440067 29.596267        1.8    D 8.2         6.5
5 BBM2008050105  301   -83.4346 29.592667        1.2    D 8.2         7.9
6 BBM2008050106  300  -83.44555 29.596917        2.5    D 8.3         6.4

score 1 · Answer 2 · answered Apr 17 '20 at 01:06

An option using data.table:

DT2[, c("pH", "Dissolved02") := 
    DT1[.SD, on=.(Reference), mult="last", .(pH, DissolvedO2)]
]

output (DT2):

       Reference Gear Longitude Latitude StartDepth Zone  pH Dissolved02
1: BBM2008050101  301 -83.44165 29.63763        1.6    D 8.2         6.1
2: BBM2008050102  301 -83.43972 29.63023        1.8    D 8.1         5.9
3: BBM2008050103  301 -83.43402 29.60557        1.8    D 8.2         6.1
4: BBM2008050104  301 -83.44007 29.59627        1.8    D 8.2         6.5
5: BBM2008050105  301 -83.43460 29.59267        1.2    D 8.2         7.9
6: BBM2008050106  300 -83.44555 29.59692        2.5    D 8.3         6.4

data:

library(data.table)
DT1 <- fread("Reference Depth Beg_end Temperature Conductivity pH Salinity DissolvedO2
BBM2008050101    0.2 B   24.8    52.1    8.2 34.3    6.1 
BBM2008050101   1.0 B   24.8    52.4    8.2 34.5    6.1 
BBM2008050101   1.4 B   24.8    52.4    8.2 34.5    6.1 
BBM2008050102   0.2 B   24.5    53.0    8.1 35.0    6.3 
BBM2008050102   1.0 B   24.5    53.0    8.1 34.9    6.0 
BBM2008050102   1.6 B   24.5    53.0    8.1 35.0    5.9 
BBM2008050103   0.2 B   24.9    51.1    8.2 33.5    6.1 
BBM2008050103   1.0 B   24.9    51.1    8.2 33.5    6.1 
BBM2008050103   1.6 B   24.9    51.1    8.2 33.5    6.1 
BBM2008050104   0.2 B   25.1    51.4    8.2 33.8    6.7 
BBM2008050104   1.0 B   25.1    51.4    8.2 33.8    6.5 
BBM2008050104   1.6 B   25.1    51.4    8.2 33.8    6.5 
BBM2008050105   0.2 B   24.9    51.9    8.1 34.1    7.7 
BBM2008050105   1.0 B   24.9    51.9    8.2 34.1    7.9 
BBM2008050106   0.2 B   25.4    51.1    8.3 33.5    7.0 
BBM2008050106   1.0 B   25.4    51.1    8.3 33.5    6.5 
BBM2008050106   2.0 B   25.4    51.1    8.3 33.5    6.5 
BBM2008050106   2.3 B   25.4    51.1    8.3 33.5    6.4")

DT2 <- fread("Reference Gear Longitude Latitude StartDepth Zone
BBM2008050101   301 -83.44165   29.637633   1.6 D
BBM2008050102   301 -83.439717  29.630233   1.8 D
BBM2008050103   301 -83.434017  29.605567   1.8 D
BBM2008050104   301 -83.440067  29.596267   1.8 D
BBM2008050105   301 -83.4346    29.592667   1.2 D
BBM2008050106   300 -83.44555   29.596917   2.5 D")

I tried `physical[, c("pH", "Dissolved02") := rawhog[.SD, on=.(Reference), mult="last", .(pH, DissolvedO2)] ]` but nothing happened. No data table combined. — Johnny5ish, Apr 23 '20 at 02:35
There will not be a new data.table. It updates the ‘physical’ data.table by reference — chinsoon12, Apr 23 '20 at 03:58
I tried switching the data frames, and either way, my data frame still have the same variables, nothing new added. — Johnny5ish, Apr 24 '20 at 01:31
Was there an error message? You will need to convert the 2 data.frames to data.tables using setDT — chinsoon12, Apr 24 '20 at 05:10
No errors, it appeared to run fine, and they were data tables, just nothing changed. odd. It's okay the other way worked for me. Thank you! — Johnny5ish, Apr 24 '20 at 20:09

How to merge a particular row of a group of rows with the same reference #, with a df that contains a single row with the matching reference #

2 Answers2