0

I am comparing two csv files using R/Rstudio and I would like to compare them line by line, but in a specific order based on their columns. If my data looks like:

first <-read.csv(text="
name,    number,    description,    version,    manufacturer
A123,    12345,     first piece,    1.0,        fakemanufacturer
B107,    00001,     second,         1.0,        abcde parts
C203,    20000,     third,          NA,         efgh parts
D123,    12000,     another,        2.0,        NA")

second csv:

second <- read.csv(text="
name,    number,    description,    version,    manufacturer
A123,    12345,     first piece,    1.0,        fakemanufacturer
B107,    00001,     second,         1.0,        abcde parts
C203,    20000,     third,          NA,         efgh parts
E456,    45678,     third,          2.0,       ")

I'd like to have a for loop that looks something like:

for line in csv1:
    if number exists in csv2:
        if csv1$name == csv2$name:
            if csv1$description == csv$description:
                if csv1$manufacturer == csv2$manufacturer:
                    break
                else:
                    add line to csv called changed, append a value for "changed" column to manufacturer
            else:
                add line to csv called changed, append a value for "changed" column to description

and so on so that the output then looks like:

name    number    description    version    manufacturer        changed
A123    12345     first piece    1.0        fakemanufacturer    number
B107    00001     second         1.0        abcde parts         no change
C204    20000     third                     newmanufacturer     number, manufacturer     
D123    12000     another        2.0                            removed
E456    45678     third          2.0                            added

and if at any point in this loop something doesn't match, I'd like to know where the mismatch was. The lines can match by number OR description. for example, given the 2 lines above, I would be able to tell that number changed between the two csv files. Thanks in advance for any help!!

MrFlick
  • 195,160
  • 17
  • 277
  • 295
pbthehuman
  • 123
  • 3
  • 12
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. How do you want to keep track of the mismatches exactly? What's supposed to come after those "if" statements? – MrFlick Feb 20 '20 at 21:16
  • @MrFlick I've edited for clarity, thank you! – pbthehuman Feb 20 '20 at 21:45
  • Not exactly what you asked for, but you might find the package `dataCompareR` useful. –  Feb 20 '20 at 23:43

1 Answers1

0

It should be something like this, but as you have provided no data to test it I cannot vouch for my code:


cmpDF <- function(DF1, DF2){
    DF2 <- DF2[DF2$number %in% DF1$number,] #keep only the rows of DF2 that are
                                             #also in DF1
    retChar <- character(nrow(DF1)) 
    names(retChar) <- DF1$number #call the retChar vector with the number
                                 # to be able to update it later

    DF1 <- DF1[DF1$number %in% DF2$number,]#keep only the rows of DF1 that are
                                             #also in DF2


    # sort rows to make sure that equal rows have the same row number:
    DF1 <- DF1[order(DF1$number),] 
    DF2 <- DF2[order(DF2$number),]

    equals <- DF1 == DF2
    identical <- rowSums(DF1 == DF2) == ncol(DF1) #here all elements are the same
    retChar[as.character(DF1$number[identical])] <- "no change"

    for(i in 1:ncol(DF1)){
        if(colnames(DF1)[i] == "number") next

        different <- !equals[,i]
        retChar[as.character(DF1$number[different])] <- ifelse(nchar(retChar[as.character(DF1$number[different])]),
                                                               paste0(retChar[as.character(DF1$number[different])], colnames(DF1)[i], sep = ", "),
                                                               colnames(DF1)[i])

    }

    retChar[nchar(retChar) == 0] <- "number not in DF2" 
    return(retChar)

}

Grada Gukovic
  • 1,228
  • 7
  • 13
  • Thank you! my test data looks exactly like the csv1 and csv2 i havein the post, with the first row being my column headers. There is more data in them, but too much to list for just one post! I think your solution is definitely on the right track, but the two dataframes may not always be the same size. Thank you again for your help, I really appreciate it! – pbthehuman Feb 21 '20 at 13:30