1

So my problem may be naïve, but I've been searching for a long while and I still can’t find the answer. I have two large data sets:

  1. One is a census file with more than 700,000 records.

    Lastname Census     1stname Census     census_year                
    C2last                C2first            1880
    C3last                C3first            1850
    C4last                C4first            1850
    
  2. The other one is a sample of civil registers composed of 80,000 observations.

    Lastname Reg      1stname reg       birth_year               
    P2Last              P2first         1818
    P3last              P3first         1879
    P4last              P4first         1903
    

I need to carry out the Cartesian product of both data sets, which is obviously a huge file (700,000 x 80,000), where for each row of the census we should be adding the 80,000 civil registers with an extra variable.

The values for this extra variable fulfill a condition. The condition is that the census year (a variable of the census) is larger than the variable 'year of birth' of the civil registers (or, in other words, the census year is younger than the birth on the register).

As I said, the goal is to make the Cartesian product, but adding an extra variable (flag) that gives a '1' when the condition is fulfilled (census year > birth year) or '0', when it's not:

LastNCens  1stNCens   cens_year  LastNamReg   1stNamReg       birth      Flag
C2last     C2first         1880      P2Last     P2first        1818         1
                                     P3last     P3first        1879         1
                                     P4last     P4first        1903         0
C3last     C3first         1850      P2Last     P2first        1818         1
                                     P3last     P3first        1879         0
                                     P4last     P4first        1903         0
C4last     C4first         1860      P2Last     P2first        1818         1
                                     P3last     P3first        1879         0
                                     P4last     P4first        1903         0

All this, keeping in mind that the product is too big.

I have tried many things (compare, diff, intersect) and I've read also other things that I couldn't apply (df.where, pd.merge), but they don't do what I need and I can't use them here. My simple approach would have been:

cp <-  merge(census, register,by=NULL);

final.dataframe <- cp [which (cp$census_year > cp$birth_year_hsn ),]

But R runs out of memory.

It goes without saying that the resulting data frame (the Cartesian product) would also be valid with only those records that are flagged as '1' (getting rid of those with Flag='0').

I hope this is well explained and also useful for other people… Thanks a million for any tip. It's very welcome.

Parfait
  • 104,375
  • 17
  • 94
  • 125
Paco
  • 65
  • 9
  • Please post the exact error you receive with your attempted `merge` call? – Parfait Feb 12 '19 at 18:00
  • Hello, and thank you for taking the time to have a look at it. Answering your question, I don't get any error messages. It's just that my pc and laptop, both at work, are not powerful enough to do that simple (but very heavy) merge. They run out of memory, and so I can't find any workaround. I think that, one could be to split both large data sets into tens of smaller blocks and then index them. Once I had these blocks, the idea would be to run them in a loop, merging them altogether by their indexes, but I don't know how to index the blocks either. – Paco Feb 13 '19 at 12:05
  • See this [SO thread](https://stackoverflow.com/q/10600060/1422451). Consider `data.table` for the cross join (2nd answer). – Parfait Feb 13 '19 at 21:30
  • Thank you @Parfait for your time... I've been having a look at your suggestion, but I think the problem is still the same (unless I'm not applying the DT correctly). I'm far from being an expert, so I may be making many mistakes, but the error I now get points out at the large extension of rows involved ("Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in more than 2^31 rows (internal vecseq reached physical limit)..."). – Paco Feb 15 '19 at 16:19

1 Answers1

1

Going along with the comments to the question one could achieve what your looking for using the data.table package. The package modifies by reference, and as such can help reduce the amount of memory used for subsets, merges and calculations. For more information about the package i suggest using their wikipedia github page, which contain a quick cheat-sheet for most computations.

Below is an example of how one could perform the kind of merge that you are looking for using data.table. It is refered to as a non-equi join. A few notes. It seems that a bug is present in the data.table package, which has not yet been noted. by = .EACHI seems necessary when you output both of the joined columns, in order to obtain the original values of the left part of the join. However it is a small cost.

df1 <- fread("Lastname_Census     firstname_Census     census_year                
              C2last                C2first            1880
              C3last                C3first            1850
              C4last                C4first            1850", key = "census_year")
df2 <- fread("Lastname_Reg      firstname_reg       birth_year               
              P2Last              P2first         1818
              P3last              P3first         1879
              P4last              P4first         1903", key = "birth_year")

cart_join <- 
    df2[df1, #join df1 on df2
      on = .(birth_year >= census_year), #join criteria
      #Force keep all columns to keep (i.var, indicates to keep var from df1)
      j = .(i.Lastname_Census,
            i.firstname_Census, 
            Lastname_Reg, 
            firstname_reg, 
            birth_year, 
            i.census_year, 
            Flag = birth_year >= i.census_year), 
      #Force evaluation on each i. This will keep the correct birth_year (seems to be a bug)
      by = .EACHI,
      #Let the table grow beyond nrow(df1) + nrow(df2) 
      allow.cartesian = TRUE][,-1] #Remove the first column. It is a merge column

Edit (A few possible bugs)

After playing around with the join, i noticed a few irregularities, and followed it by opening an issue here. Note that you should be careful with my above suggested answer. It seems to work fine while returning values from both tables (other than the once used in the on statement), but it is not impenetrable. Please refer to my open issue for more information.

Oliver
  • 8,169
  • 3
  • 15
  • 37
  • Fyi, there is also `x.birth_year`, though I'm not sure if that addresses the issue in your comment about needing by=.EACHI. (I also commented on github, but figure it's simpler to make minor comments here in case there's back and forth.) – Frank Feb 28 '19 at 17:36
  • 1
    Hello @Frank, thank you i actually had not considered `x.var` as an alternative. This does (2) and (3) on my post on github, but even without these, it still seems inconsistent with expected behaviour. Thanks as well for linking me up to the related issue, I thought I had looked through them all for similar problems, without luck. – Oliver Feb 28 '19 at 17:52