R: How to locate and compare a particular element between two CSV text files

Question

I found some similar questions such as this one (about comparing attributes in XML files), this one (about a case where the compared values are numeric) and this one (about getting a number of columns that differ between two files) but nothing about this particular problem.

I have two CSV text files on which many, but not all, rows are equal. The files have the same amount of columns with same data type on the columns but they do not have the same amount of rows. The amount of rows on both files is around 120K and both files have some rows that are not on the other.

Simplified versions of these files would look as shown below.

File 1:

PROFILE.ID,CITY,STATE,USERID
2265,Miami,Florida,EL4950
4350,Nashville,Tennessee,GW7420
5486,Durango,Colorado,BH9012
R719,Flagstaff,Arizona,YT7460
Z551,Flagstaff,Arizona,ML1451

File 2:

PROFILE.ID,CITY,STATE,USERID
1173,Nashville,Tennessee,GW7420
2265,Miami,Florida,EL4950
R540,Flagstaff,Arizona,YT7460
T216,Durango,Colorado,BH9012

In the actual files many of the USERID values in the first file can also be found in the second file (some may not be present however). Also while the USERID values are unchanged for all users, their PROFILE.ID may have changed.

The problem is that I would have to locate the rows where the PROFILE.ID has changed.

I am thinking that I would have to use the following sequence of steps to analyze it in R:

Load both files to R Studio as data frames
Loop through the USERID column on the first file (which has more rows)
Search the second file for each USERID found in the first file
Return the corresponding PROFILE.ID from second file
Compare the returned value with what is in the first file
Output the rows where the PROFILE.ID values differ

I was thinking of writing something like the code shown below but am not sure if there are better ways to accomplish this.

library(tidyverse)

con1  <- file("file1.csv", open = "r")
con2  <- file("file2.csv", open = "r")

file1 <- read.csv(con1, fill = F, colClasses = "character")
file2 <- read.csv(con2, fill = F, colClasses = "character")

for (i in seq(nrow(file1))) {
   profIDFile1 <- file1$PROFILE.ID[i]
   userIDFile1 <- file1$USERID[i]

   profIDRowFile2 <- filter(file2, USERID == userIDFile1)
   profIDFile2 <- profIDRowFile2$PROFILE.ID

   if (profIDFile1 != profIDFile2) {
     output < - profIDRowFile2
   }

}

write.csv(output, file='result.csv', row.names=FALSE, quote=FALSE)

close(con1)
close(con2)

Question: Is there a package in R that can do this kind of comparison or what would be a good way to accomplish this in R script?

score 3 · Accepted Answer · answered Feb 01 '20 at 23:03

I think you can do this with a simple join:

library(dplyr)
full_join(file1, file2, by = "USERID") %>%
  filter(PROFILE.ID.x != PROFILE.ID.y)
#   PROFILE.ID.x    CITY.x   STATE.x USERID PROFILE.ID.y    CITY.y   STATE.y
# 1         4350 Nashville Tennessee GW7420         1173 Nashville Tennessee
# 2         5486   Durango  Colorado BH9012         T216   Durango  Colorado
# 3         R719 Flagstaff   Arizona YT7460         R540 Flagstaff   Arizona

This shows that those three USERID rows have differeing PROFILE.ID fields. (The .x are from file1, .y from file2.)

That test does not deal very well with IDs that are missing in one, so you might add logic such as:

full_join(file1, file2, by = "USERID") %>%
  filter(is.na(PROFILE.ID.x) | is.na(PROFILE.ID.y) |
           PROFILE.ID.x != PROFILE.ID.y)
#   PROFILE.ID.x    CITY.x   STATE.x USERID PROFILE.ID.y    CITY.y   STATE.y
# 1         4350 Nashville Tennessee GW7420         1173 Nashville Tennessee
# 2         5486   Durango  Colorado BH9012         T216   Durango  Colorado
# 3         R719 Flagstaff   Arizona YT7460         R540 Flagstaff   Arizona
# 4         Z551 Flagstaff   Arizona ML1451         <NA>      <NA>      <NA>

The fourth row indicates an ID missing in file2. This here is likely an artifact of a small sample dataset (which is good on SO :-), I'm not certain if this is interesting or meaningful to you.

that works! Thank you. I do get a "“cannot allocate vector of size 557.6 Mb” error (probably because of the size of the files) but I opened a separate question for that. — user100487, Feb 03 '20 at 01:54

score 0 · Answer 2 · answered Feb 01 '20 at 23:51

0

We can do this with base R

subset(merge(file, file2, by = 'USERID'), PROFILE.ID.x != PROFILE.ID.y)

answered Feb 01 '20 at 23:51

akrun

874,273
37
540
662

R: How to locate and compare a particular element between two CSV text files

2 Answers2

Linked