So my problem may be naïve, but I've been searching for a long while and I still can’t find the answer. I have two large data sets:
One is a census file with more than 700,000 records.
Lastname Census 1stname Census census_year C2last C2first 1880 C3last C3first 1850 C4last C4first 1850
The other one is a sample of civil registers composed of 80,000 observations.
Lastname Reg 1stname reg birth_year P2Last P2first 1818 P3last P3first 1879 P4last P4first 1903
I need to carry out the Cartesian product of both data sets, which is obviously a huge file (700,000 x 80,000), where for each row of the census we should be adding the 80,000 civil registers with an extra variable.
The values for this extra variable fulfill a condition. The condition is that the census year (a variable of the census) is larger than the variable 'year of birth' of the civil registers (or, in other words, the census year is younger than the birth on the register).
As I said, the goal is to make the Cartesian product, but adding an extra variable (flag) that gives a '1' when the condition is fulfilled (census year > birth year) or '0', when it's not:
LastNCens 1stNCens cens_year LastNamReg 1stNamReg birth Flag
C2last C2first 1880 P2Last P2first 1818 1
P3last P3first 1879 1
P4last P4first 1903 0
C3last C3first 1850 P2Last P2first 1818 1
P3last P3first 1879 0
P4last P4first 1903 0
C4last C4first 1860 P2Last P2first 1818 1
P3last P3first 1879 0
P4last P4first 1903 0
All this, keeping in mind that the product is too big.
I have tried many things (compare, diff, intersect) and I've read also other things that I couldn't apply (df.where
, pd.merge
), but they don't do what I need and I can't use them here. My simple approach would have been:
cp <- merge(census, register,by=NULL);
final.dataframe <- cp [which (cp$census_year > cp$birth_year_hsn ),]
But R runs out of memory.
It goes without saying that the resulting data frame (the Cartesian product) would also be valid with only those records that are flagged as '1' (getting rid of those with Flag='0'
).
I hope this is well explained and also useful for other people… Thanks a million for any tip. It's very welcome.