I found some similar questions such as this one (about comparing attributes in XML files), this one (about a case where the compared values are numeric) and this one (about getting a number of columns that differ between two files) but nothing about this particular problem.
I have two CSV text files on which many, but not all, rows are equal. The files have the same amount of columns with same data type on the columns but they do not have the same amount of rows. The amount of rows on both files is around 120K and both files have some rows that are not on the other.
Simplified versions of these files would look as shown below.
File 1:
PROFILE.ID,CITY,STATE,USERID
2265,Miami,Florida,EL4950
4350,Nashville,Tennessee,GW7420
5486,Durango,Colorado,BH9012
R719,Flagstaff,Arizona,YT7460
Z551,Flagstaff,Arizona,ML1451
File 2:
PROFILE.ID,CITY,STATE,USERID
1173,Nashville,Tennessee,GW7420
2265,Miami,Florida,EL4950
R540,Flagstaff,Arizona,YT7460
T216,Durango,Colorado,BH9012
In the actual files many of the USERID
values in the first file can also be found in the second file (some may not be present however). Also while the USERID
values are unchanged for all users, their PROFILE.ID
may have changed.
The problem is that I would have to locate the rows where the PROFILE.ID
has changed.
I am thinking that I would have to use the following sequence of steps to analyze it in R:
- Load both files to R Studio as data frames
- Loop through the
USERID
column on the first file (which has more rows) - Search the second file for each
USERID
found in the first file - Return the corresponding
PROFILE.ID
from second file - Compare the returned value with what is in the first file
- Output the rows where the
PROFILE.ID
values differ
I was thinking of writing something like the code shown below but am not sure if there are better ways to accomplish this.
library(tidyverse)
con1 <- file("file1.csv", open = "r")
con2 <- file("file2.csv", open = "r")
file1 <- read.csv(con1, fill = F, colClasses = "character")
file2 <- read.csv(con2, fill = F, colClasses = "character")
for (i in seq(nrow(file1))) {
profIDFile1 <- file1$PROFILE.ID[i]
userIDFile1 <- file1$USERID[i]
profIDRowFile2 <- filter(file2, USERID == userIDFile1)
profIDFile2 <- profIDRowFile2$PROFILE.ID
if (profIDFile1 != profIDFile2) {
output < - profIDRowFile2
}
}
write.csv(output, file='result.csv', row.names=FALSE, quote=FALSE)
close(con1)
close(con2)
Question: Is there a package in R that can do this kind of comparison or what would be a good way to accomplish this in R script?