0

I hope you are doing well. I'm processing unstructured data for a CHR master dataset:

originaldata <- read.csv('./csv/Info ECE 2014 - 2021.csv', header = TRUE, na.strings = "")

After cleaning and structuring data, I'm doing the text processing and for this I'm getting two different datasets from this file:

dataset1 <- data.frame(id = originaldata$id)
# Making the text processing here and adding it to dataset1

dataset2 <- data.frame(id = originaldata$id)
# Making the text processign here and adding it to dataset2

newdata <- merge(dataset1, dataset2, by = "id")

The problem I have is that when I merge dataset 1 and 2, (both have the same row number, e.q. 10,692 obs., also equal than the original data), newdata has 11,392 obs. (700 additional rows) and I cannot figure why, (considering that both id rows become from the same source). Any help will be truly appreciate.

I'm using merge from R base

nhedzll
  • 35
  • 3
  • 1
    You probably have duplicate IDs. A merge doesn't know which value goes where in such a case so it creates a row for all possible combinations of the duplicated value. In the future, it's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Nov 07 '22 at 17:36
  • @MrFlick thanks for your kind response. Data looks like this: `originaldata <- data.frame(id = c("9950 004733", "9671 013717", "9116 0000611"), sex = c("MALE", "MALE", "FEMALE"), age = c(22, 37, 45))` dataset1 looks like `dataset1 <- data.frame(id = c("9950 004733", "9671 013717", "9116 0000611"), var_x1 = c(TRUE, TRUE, FALSE), var_x2 = c(NA, NA, TRUE))` dataset2 looks like `dataset2 <- data.frame(id = c("9950 004733", "9671 013717", "9116 0000611"),var_y1 = c(FALSE, NA, FALSE), var_y2 = c(FALSE, NA, NA))` and newdata is the merge of dataset1 and 2 (id, var_x1, var_x2, var_y1, var_y2) – nhedzll Nov 07 '22 at 17:55
  • It's better to edit your question with additional details than to try to put them in comments since it's much harder to format them there. I can't reproduce the problem with your example. All datasets have three rows and the merge has three rows. If it doesn't replicate the issue, it's not a helpful example. As I pointed out, if you duplicate an ID value in originaldata, you'll see that the number of rows in the merge will increase. – MrFlick Nov 07 '22 at 18:02

1 Answers1

0

Try:

library(dplyr)
union(dataset1, dataset2)

union() should removes duplicated values