1

I have two data frames, each with 1.2 million rows.

I try merging them with dummy <- merge(df1, df2). Neither data frames have a column in common, but both data frames' rows are in the order I want them.

I want them merged side-by-side, but when I run the merge function, I get this error:

    Error: cannot allocate vector of size 5905.6 Gb
In addition: Warning messages:
1: In rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) :
  Reached total allocation of 8107Mb: see help(memory.size)
2: In rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) :
  Reached total allocation of 8107Mb: see help(memory.size)
3: In rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) :
  Reached total allocation of 8107Mb: see help(memory.size)
4: In rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) :
  Reached total allocation of 8107Mb: see help(memory.size)
Username
  • 3,463
  • 11
  • 68
  • 111
  • What is `object.size(df1)` and `object.size(df2)` – MichaelChirico Mar 04 '16 at 02:34
  • `15113968 bytes` and `215909312 bytes` @MichaelChirico – Username Mar 04 '16 at 02:35
  • what kind of merge are you trying to do? http://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right – MichaelChirico Mar 04 '16 at 02:36
  • 1
    (1.2 million rows screams to me that you should be using `data.table`, by the way...) – MichaelChirico Mar 04 '16 at 02:36
  • @MichaelChirico I guess an outer join would be closest to what I want. Both tables have equal number of rows, I just want them side-by-side, if that makes sense. – Username Mar 04 '16 at 02:38
  • 4
    "Merged side by side": 5905.6 Gb. There's really no hope to make this work. I suspect you are merging on a small number of columns and that there are many duplicated values. Fore each combination of duplicated values in the two datasets there are N x M matches in a Cartesian join. Please rethink your approach. If by "side-by-side" means 'row-by-row' then you want `cbind`, NOT `merge`. The question should have been what are `dim(df1)` and `dim(df2)`. – IRTFM Mar 04 '16 at 02:39
  • @42- with the caveat that "side-by-side" is typically meaningless if `df1` and `df2` aren't pre-ordered – MichaelChirico Mar 04 '16 at 02:43
  • @MichaelChirico How do you mean by "pre-ordered?" – Username Mar 04 '16 at 02:44
  • Does it make sense for the 1st row of `df1` to be associated with the first row of `df2` (and so on)? Why? Typically it's because the first rows share some characteristic (e.g., they're for the same year, or something like that). More robust is to identify that shared characteristic and _join according to that_ – MichaelChirico Mar 04 '16 at 02:45
  • @MichaelChirico Yes, it makes sense for first row of `df1` to be associated with first row of `df2`. In any case, `cbind` was what I needed – Username Mar 04 '16 at 02:47
  • Michael is trying to ascertain if there is a unique identifier that allows a correct 1-to-1 match between rows of the two datasets. It's possible that the row-number is such an identifier, but since you have not told us very much about how these two object were created, we are still guessing. – IRTFM Mar 04 '16 at 02:48
  • Still feel obligated to say that _`cbind` is in general a terrible way to join **data**_. `cbind` is designed for matrices. You should be using the `by` option in `merge`, or better yet the `on` option in `data.table`. You're lucky that your data is properly ordered, but this is often not the case (and order can change perhaps unexpectedly, to the untrained eye) – MichaelChirico Mar 04 '16 at 02:48
  • To 42 and Michael: All I can say about the data is they definitely line up such that `cbind` worked. Neither table has a column in common. They're just in the same order. I usually use `merge`, but I had no control over how the data was given to me. – Username Mar 04 '16 at 02:51
  • If there's no common column, and there's 1.2 _million_ rows, how can you _know_ they're in the same order? Seems fundamentally risky. Just want to throw several red flags that _using `cbind` for a merge should be avoided if at all possible_. – MichaelChirico Mar 04 '16 at 02:52
  • 2
    There are cbind-methods for both matrices and data.frames. There is no particular danger in using `cbind` with dataframes that is analogous to the dangers in using `apply` on dataframe. No coercion to 'matrix'-class occurs. There is a `merge( , , by="row.names")` for the situation where one can be assured of "parallel data".) – IRTFM Mar 04 '16 at 02:53
  • 1
    @MichaelChirico You're absolutely correct. Anyone else reading this *must* take caution. In my case, I know *for sure* the rows from both datasets line up. I wish I could say more about how I know, but I cannot. – Username Mar 04 '16 at 02:54

1 Answers1

1

You might want to try data.table::cbind for a by-reference alternative to merging on row.names.

library("data.table")
setDT(df1)
setDT(df2)
data.table::cbind(df1, df2)

But note this warning in the package reference:

These functions are masked in data.table because of this feature in cbind :

“The data frame method will be used if at least one argument is a data frame”. This means that cbind(DT,DF) dispatches to the S3 method cbind.data.frame even when cbind.data.table is provided by data.table. Therefore, we have masked these functions. Warning messages will be presented to the user (by R) when the data.table package is loaded. If the first argument is not a data.table, then they revert to the base versions.

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • Depending on the structure and density of your data, you might also look at converting these data frames to sparse matrices and using something like [cBind](http://www.inside-r.org/r-doc/Matrix/cBind) – C8H10N4O2 Mar 04 '16 at 13:02