0

This is a follow-up to my earlier question about how to locate a compare a particular attribute between two text files.

Thanks to the accepted answer I got together below code which works (on my system) on smaller data sets:

library(tidyverse, data.table)

con1  <- file("file1.csv", open = "r")
con2  <- file("file2.csv", open = "r")

file1 <- select(read.csv(con1, sep = "|", fill = F, colClasses = "character"),
                PROFILE.ID, USERID)

setDT(file1)

file2 <- select(read.csv(con2, sep = "|", fill = F, colClasses = "character"),
                PROFILE.ID, USERID)

setDT(file2)

full_join(file1, file2, by = "USERID") %>%
  filter(is.na(PROFILE.ID.x) | is.na(PROFILE.ID.y) |
                PROFILE.ID.x != PROFILE.ID.y)

close(con1)
close(con2)

The problem: When R starts processing the full_join function, it eventually stops with the error cannot allocate vector of size 557.6 Mb.

The environment: This is a 64-bit R v.3.6.2 on a Windows 10 OS and memory.limit() returns 16222. I don't have any other objects loaded in R except what is loaded by above code.

Probable cause: The problem probably comes from that the two CSV files have about 120K rows and 83 columns each.

What I have tried so far but without resolving the issue:

  • Included the use of the select() function to remove the unnecessary columns.
  • Included the use of data.table and setDT() to convert the data frames to data tables.
  • Closed all apps that have a visible UI (Outlook, Google Chrome, Excel, etc).

Regardless of what I have tried the error always refers to "557.6 Mb". I cannot add more RAM to this machine at the moment as it is a company laptop.

Question: Is there a way to load the files in chunks or some other way (in re-writing the code) to get around the error?

user100487
  • 268
  • 1
  • 6
  • 15
  • 1
    Check how many overlapping USERIDs there are. What happens if you join the head(10000) of both tables? – mr.joshuagordon Feb 03 '20 at 02:23
  • thank you @mr.joshuagordon! I did that and did get some output without the process crashing. The output seemed to indicate that the process has an issue because some of the rows have a blank USERID. So I then added a line to first filter() the rows to only those where USERID is not blank, and it now seems to be working. That was a good tip so thank you for that. – user100487 Feb 03 '20 at 04:18
  • 1
    If you're using `data.table`, then you should use the DT-friendly `merge` instead of `dplyr`'s join functions. It's not that the join functions are bad, but `data.table` tries very hard to be memory-efficient, and this includes non-deep-copy merges. Take a quick look at `?merge.data.table`. Long story short: `merge(file1, file2, by = "USERID", all = TRUE)[is.na(PROFILE.ID.x) | is.na(PROFILE.ID.y) | PROFILE.ID.x != PROFILE.ID.y,]`. If that doesn't work, then my next suggestion would be to move this to a database (or similar non-local, non-R solution). – r2evans Feb 03 '20 at 04:30
  • 1
    good to know! thank you @r2evans. – user100487 Feb 03 '20 at 04:32
  • 1
    (My answer in your previous question leaned towards `dplyr` because I saw `tidyverse` and I don't recall see a "large data" hint. While `dplyr` has many advantages, memory is not one of them, I think `data.table` wins that debate many times over.) – r2evans Feb 03 '20 at 04:34
  • 1
    Note you are reading in all 83 columns when you only require 2. Although it doesn't seem to be a bottleneck for this problem, you'd be well advised to only read in the columns you need `file1 <- fread("file1.csv", select = c("PROFILE.ID", "USERID"))` – Hugh Feb 06 '20 at 15:34
  • good point. Thank you @Hugh – user100487 Feb 06 '20 at 15:37

1 Answers1

0

My first solution here was that to remove those rows from file1 and file2 where the USERID is blank.

This is not an ideal solution (considering that merge is probably more DT-friendly than the join functions in dplyr) as per comment made by @r2evans.

So a more effective solution would probably have been replacing the full_join with a merge, like so

merge(file1, file2, by = "USERID", all = TRUE)[is.na(PROFILE.ID.x) | is.na(PROFILE.ID.y) | PROFILE.ID.x != PROFILE.ID.y,]

But the first solution I tried that corrected the issue here was to add the lines file1 <- filter(file1, USERID != "") and file2 <- filter(file2, USERID != "") into the code.

In the end the code looked like this:

library(tidyverse, data.table)

con1  <- file("file1.csv", open = "r")
con2  <- file("file2.csv", open = "r")

file1 <- select(read.csv(con1, sep = "|", fill = F, colClasses = "character"),
                PROFILE.ID, USERID)

file1 <- filter(file1, USERID != "")
setDT(file1)

file2 <- select(read.csv(con2, sep = "|", fill = F, colClasses = "character"),
                PROFILE.ID, USERID)

file2 <- filter(file2, USERID != "")
setDT(file2)

full_join(file1, file2, by = "USERID") %>%
  filter(is.na(PROFILE.ID.x) | is.na(PROFILE.ID.y) |
                PROFILE.ID.x != PROFILE.ID.y)

close(con1)
close(con2)
user100487
  • 268
  • 1
  • 6
  • 15