How do I merge two large data.frames and take a select portion of these values?

Question

specdata <- list.files(getwd(), pattern="*.csv")
directory <- lapply(specdata, read.csv)
directory_final <- do.call(rbind, directory)
library(tidyverse)
one <- select(directory_final, nitrate, ID)
two <- no.omit(one)
a <- select(directory_final, sulfate, ID)
b <- na.omit(a)
two_df <- mutate(two, id = rownames(two))
b_df <- mutate(b, id = rownames(b))
library(plyr)
alpha <- join(two_df, b_df, by = "id", match = "all")
alpha$id <- NULL

dput(head(alpha, 5))
structure(list(sulfate = c(7.21, 5.99, 4.68, 3.47, 2.42), ID = c(1L, 
1L, 1L, 1L, 1L), nitrate = c(0.651, 0.428, 1.04, 0.363, 0.507
), ID = c(1L, 1L, 1L, 1L, 1L)), row.names = c(NA, 5L), class = "data.frame")

dim(alpha)
118783 4

Think of it like this; I have two long strings, one string extends for 10m and the other 12m. One string is red and the other blue. both strings have knots at 0.05 cm intervals all along the entire string. At every 10 knots, I give each individual knot the ID-1 for red and ID1-1 for blue and so forth. I have each string on each hand, however; I want these two strings to be one long string, merged side-by-side. So I tie the top and end of the string. Now if I want an individual knot, from ID-1, 1/10 length of the ID-1 string, I untie the first and so forth. – I want a function that lets me find the mean of every knot I untie either from ID-1 ranging from 1:332, or ID1-1 ranging from 1:332.

I want something like

alpha_function(nitrate, ID = 1:50)
alpha_function(sulfate, ID = 1:50)

A function that can gather all the mean values of nitrate or sulfate by ID

also, when I use the 'join' function, I can only take mean values of the first data.frame (b_df), that I place in this function. whereas, the second always returns NA.

mean(alpha$sulfate)
3.189369

mean(alpha$nitrate)
NA

I would like to also know as to why this happens and how it can be fixed so both total values can be taken?

cbind wouldn't sync the ID's, if that is required. If you need the ID's to be the key for the join you should specify it using the `on =...` in the merge — Chris Littler, Jul 08 '19 at 12:05
Based on the example above joining by ID is not a requirement, as both datasets have multiple rows for ID = 1, but after the merge there's no evidence of getting more rows for ID = 1... — AntoniosK, Jul 08 '19 at 12:07
Can you create hypothetical data frames like yours under your post? — DSA, Jul 08 '19 at 12:12
@AntoniosK It does not work as there are a different number of rows. — Lime, Jul 08 '19 at 12:41
@ChrisLittler I have applied this, the code does not stop. Do not know whether it will work — Lime, Jul 08 '19 at 12:42
@Emil I don't think it's clear what you're trying to do here. Can you create a smaller version of your datasets and show the actual output you expect to get? Maybe get `df1` with 4 rows (2 with ID = 1 and 2 with ID = 2) and get `df2` with 5 rows (3 with ID = 1 and 2 with ID = 2) and show what the output should be. — AntoniosK, Jul 08 '19 at 12:45
@AntoniosK I have only recently started, however, I can try giving that a go. All I want is for the two columns "nitrate" and "ID", along with their rows to merge into one data.frame that contains also the "sulfate" and "ID1" columns. say, Nitrate and ID are two columns with the same rows corresponding one another at 110,000 rows, and sulfate and ID1 are the same rows corresponding one another at 120,000 rows. How do I merge these two individual data sets so that the columns and rows are in one data.frame. So, col = nitrate, ID, sulfate, ID1, rows = 110,000, 110,000, 120,000, 120,000. — Lime, Jul 08 '19 at 12:51
And the next part; how would I then extract a proportion of data from this dataframe? For example, I create a function that would select from the ID integer values the proportion of numerical values belonging to nitrate, or from the ID1 integer values the proportion of numerical values belonging to sulfate. And then I can apply the mean function to this function? — Lime, Jul 08 '19 at 12:55
@Emil, I can't understand what you're trying to do, so if you create some made-up data frames which imitate your real data, it would be more helpful for the readers who want to help you. — DSA, Jul 08 '19 at 12:57
@DSA Not my best out, however, the code itself should give an idea. ```> tribble( + ~row, ~nitrate, ~ID, ~sulfate, ~ID1, + "one", 0.99, 1, 0.52, 3, + "two", 0.2, 2, 0.56, 5, + "three", 0.5, 100, 0.58, 101, + "four", 0,4, 250, 0.59, 252, + "five", 0.6, 330, 0.51, 329, + "six", 0.8, 332, 0.53, + ) ``` The output is not tidy, I know not how to do that. However, that is the general idea. ID repeats itself, 1, 1, 1, 1, .... 2, 2, 2, ..... 3, 3,3, ..... 331, 331, 332, 332, 332. Which I could not do. Bearing in mind that sulfate+ID1 have longer a row — Lime, Jul 08 '19 at 13:16
See [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269/4996248) for making a reproducible example. It is still hard to see just what you are trying to do. The edit shows what the output looks like, but not what the input looks like, especially in the case where the numbers of rows differ (where even the output isn't clear). Perhaps you can just extend the shorter of the two dataframes by `NA` for the missing rows, and then just use `cbind()`? — John Coleman, Jul 08 '19 at 14:37
Also, when I run the `tibble` code in the edit, I get an error message: `Error in eval_tidy(xs[[i]], unique_output) : object 'ID' not found`, so even your intended output isn't reproducible. — John Coleman, Jul 08 '19 at 14:39
@JohnColeman okay. I have four sticks. 2/4 of these sticks are glued together, of the first 2/4, both sticks are 1m of length each. and of the other set of 2/4 sticks, they are of 1.5m length each. How do I then glue these two sets of 2/4 sticks together? Making one set of sticks, 2/4 at 1m length, and 2/4 at 1.5m length. one set of data.frame with 2 columns and a longer number of rows, and another data frame with 2 columns with a shorter number of rows, combined or merged into one data frame. I have tried ```merge(two, b, on = ID, all = TRUE)``` fails to work along with cbind — Lime, Jul 08 '19 at 14:50
@JohnColeman oh right yes, I had forgotten to paste my variable ID. It as something like ```ID1 <- c(1, 1, 1, 2, 2, 2, 3, 3, 4, 4 )``` and ```ID <- c(1, 1, 1, 2, 2, 2, 3, 3, 4, 4 )``` thank you for noticing. — Lime, Jul 08 '19 at 14:52
I vaguely understand what you are trying to do, but an analogy with sticks is not a reproducible example, one that covers both input and desired output. The problem with your edit is that the output it shows seems like it could be made with a naive `cbind`, which you say doesn't work. — John Coleman, Jul 08 '19 at 14:53
In any event, [this question](https://stackoverflow.com/q/19074163/4996248) might help — John Coleman, Jul 08 '19 at 14:58
@JohnColeman I have tried cbind, it always says the two are not of the same length (rows). The analogy is to briefly indicate how cbind does not work. The problem with merge, having already tried it, it has only returned the two merging into 2 columns extending both nitrate and ID and not reproducing 4 columns (any other time; it loops and does not stop, so I avoid it). Also, How would I go about extending the NA's nitrate so it matches the same rows as sulfate? (this should work after with cbind) would this then affect my means if there are NA values? — Lime, Jul 08 '19 at 14:58

score 0 · Answer 1 · answered Jul 08 '19 at 15:21

0

The following function might help:

combine.df <- function(df1,df2){
  n <- max(nrow(df1),nrow(df2))
  cbind(df1[1:n,],df2[1:n,])
}

The logic of the function is that R automatically inserts NA when you give it indices which are out of range.

In the event that the dataframes have differing amount of rows, the excess rows will have names like NA, NA.1, NA.2, .... If you don't like that then you could use the following version of this function:

combine.df <- function(df1,df2){
  n <- max(nrow(df1),nrow(df2))
  df <- cbind(df1[1:n,],df2[1:n,])
  row.names(df) <- 1:n
  df
}

answered Jul 08 '19 at 15:21

John Coleman

51,337
7
54
119

The first and second functions return "error in nrow(b): argument "b" is missing, with no default." – Lime Jul 09 '19 at 10:34
@Emil With what input do the functions fail? They are designed to take as input 2 dataframes. That error message suggests that you tried to use the function with only a single input or perhaps no input at all. Please edit your question to provide a [mcve]. See [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269/4996248) for what this would mean in R. It isn't that hard. We don't need all of your data so much as reproducible examples that faithfully reproduces its structure. E.g., pasting the output of `dput(head(df1,5))` and `dput(head(df2,7))` – John Coleman Jul 09 '19 at 11:34
here's an example ```dput(dim(two)) c(114349L, 2L) dput(dim(b)) c(118783L, 2L ```Here is also another example ```dput(head(two[1])) structure(list(nitrate = c(0.651, 0.428, 1.04, 0.363, 0.507, 0.474)), row.names = c(279L, 285L, 291L, 297L, 303L, 315L), class = "data.frame")```and this one also ```dput(head(b[1])) structure(list(sulfate = c(7.21, 5.99, 4.68, 3.47, 2.42, 1.43 )), row.names = c(279L, 285L, 291L, 297L, 303L, 315L), class = "data.frame")``` Thanks for teaching me on how to reproduce small examples. – Lime Jul 10 '19 at 09:15
Would it be possible to merge the two by a row or a new column? for example, I can create a new column that extends as far as the longest row of the data.frame. And have both data frames merged using the 'by' function to this new column? I would probably then have to create this same column in each data.frame, it would be something simple with an integer sequence of n+1; I gave this a go and the merge conflicts with the data. the results of nitrate and sulfate have changed, why did this happen? I had supposed they would only merge by the 'id' column I had created, with the same data.frame – Lime Jul 10 '19 at 09:43
Here's an example of the output I tried ```library(tidyverse), two_df <- mutate(two, id = rownames(two)), b_df <- mutate(b, id = rownames(b)), first <- merge(two_df, b_df, on = id, all = TRUE)``` and this is an example of the output ```dput(head(first, 1)) structure(list(ID = 1L, id = "1005", nitrate = 0.382, sulfate = 5.52), row.names = 1L, class = "data.frame")``` the first nitrate value should be '0.651' and for sulfate '7.21', as you can see, they have changed. It seems the rows have shifted from their original place and have become disordered. – Lime Jul 10 '19 at 10:11
I found that loading ```library(plyr), alpha <- join(two_df, b_df, by = "id", match = "all")``` solves it. What I want to understand, is that why do the two outputs differ from when I used to join and merge? – Lime Jul 10 '19 at 10:18

How do I merge two large data.frames and take a select portion of these values?

1 Answers1