0

I have to merge two data frames, Despesa_2021_Redand Membres_2021_Red, by NUMERO. However, the output I get is duplicating the values of the column codi_CNAE.

My code for this merge is

despesa_membres <- merge(x=Despesa_2021_Red, y=Membres_2021_Red,by="NUMERO", all=TRUE)

I've been struggling with this for a while and I can't figure it out what I'm doing wrong.

And here is a sample of my merged df:

structure(list(NUMERO = c(5, 5, 5, 6, 6, 6), codi_CNAE = c("12311", 
"12521", "12552", "01111", "01111", "01111"), grup_CNAE = c("Group 12. Altres béns i serveis", 
"Group 12. Altres béns i serveis", "Group 12. Altres béns i serveis", 
"Group 01. Aliments i begudes no alcohòliques", "Group 01. Aliments i begudes no alcohòliques", 
"Group 01. Aliments i begudes no alcohòliques"), despesatotal = c(850940.15, 
587286.98, 156106.29, 2279.22, 2279.22, 2279.22), quantitat = c(NA, 
NA, NA, 1226.04, 1226.04, 1226.04), despesamonetaria = c(850940.15, 
587286.98, 156106.29, 2279.22, 2279.22, 2279.22), sustentador = c(1, 
1, 1, 1, 0, 0), relacions = c("Sustentador principal", "Sustentador principal", 
"Sustentador principal", "Sustentador principal", "Parella", 
"Fill"), edat = c(69, 69, 69, 56, 56, 23), edatinterval = c("De 65 o més anys", 
"De 65 o més anys", "De 65 o més anys", "Entre 45 i 64 anys", 
"Entre 45 i 64 anys", "Entre 16 i 29 anys"), edatinterval_red = c("De 65 o més anys", 
"De 65 o més anys", "De 65 o més anys", "Entre 30 i 64 anys", 
"Entre 30 i 64 anys", "Menys de 29 anys"), sexe = structure(c(1, 
1, 1, 2, 1, 2), label = "sexe", format.stata = "%9.0g", labels = c(Home = 1, 
Dona = 2), class = c("haven_labelled", "vctrs_vctr", "double"
)), paisnaix = structure(c(1, 1, 1, 1, 1, 1), label = "paisnaix", format.stata = "%9.0g", labels = c(Espanya = 1, 
`Resta UE-27` = 2, `Resta UE` = 3, `Resta del món` = 4), class = c("haven_labelled", 
"vctrs_vctr", "double")), nacionalitat = structure(c(1, 1, 1, 
1, 1, 1), label = "nacionalitat", format.stata = "%9.0g", labels = c(Espanyola = 1, 
Estrangera = 2, `Espanyola i estrangera` = 3), class = c("haven_labelled", 
"vctrs_vctr", "double")), estatcivil = structure(c(3, 3, 3, 2, 
2, 1), label = "estatcivil", format.stata = "%9.0g", labels = c(Solter = 1, 
Casat = 2, Vidu = 3, Separat = 4, Divorciat = 5), class = c("haven_labelled", 
"vctrs_vctr", "double")), estudis = structure(c(4, 4, 4, 7, 4, 
7), label = "estudis", format.stata = "%9.0g", labels = c(Analfabet = 1, 
`Educació primària` = 2, `Educació secundària` = 3, `Batxillerat o FP bàsica` = 4, 
`FP superior` = 5, Grau = 6, Màster = 7, Doctorat = 8), class = c("haven_labelled", 
"vctrs_vctr", "double")), estudissimple = structure(c(3, 3, 3, 
4, 3, 4), label = "estudissimple", format.stata = "%9.0g", labels = c(`Inferior ESO` = 1, 
`ESO 1a etapa` = 2, `ESO 2a etapa` = 3, `Educació superior` = 4
), class = c("haven_labelled", "vctrs_vctr", "double")), activitat = structure(c(4, 
4, 4, 1, 1, 5), label = "activitat", format.stata = "%9.0g", labels = c(`Treballant almenys 1 hora` = 1, 
`Treball del que està absent` = 2, Aturat = 3, Jubilat = 4, Estudiant = 5, 
`Tasques de la llar` = 6, `Incapacitat laboral permanent` = 7, 
`Altra situació d'inactivitat econòmica` = 8), class = c("haven_labelled", 
"vctrs_vctr", "double")), actiu = c(0, 0, 0, 1, 1, 0), ocupat = c(0, 
0, 0, 1, 1, 0), jornada = c(0, 0, 0, 1, 1, 0), adult = c(1, 1, 
1, 1, 1, 0), ingressos = c(1, 1, 1, 1, 1, 0), ingressosmensuals = c(1206, 
1206, 1206, NA, NA, NA), ingressosmensualsinterval = c("Entre 1000 i menys de 1500 euros", 
"Entre 1000 i menys de 1500 euros", "Entre 1000 i menys de 1500 euros", 
"NA", "NA", "NA")), row.names = 390:395, class = "data.frame")
Maria
  • 173
  • 7
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. It's not clear what the problem is from your output. What values of `codi_CNAE` are bring repeated? What do you expect the output to look like? Are you sure there are no duplicated values of `NUMERO`? – MrFlick Nov 08 '22 at 15:55
  • Whenever you get more rows after a merge/join, that means the column(s) you are merging on are not unique. We can see in your sample output the duplicated values in `NUMERO = c(5, 5, 5, 6, 6, 6)`. If, say, `Despesa_2021_Red` has 2 rows with `NUMERO = 6` and `Membres_2021_Red` has 3 rows with `NUMERO = 6`, then the result will match every pair, so you will have 2 * 3 = 6 rows with `NUMERO = 6` in the result. – Gregor Thomas Nov 08 '22 at 16:47
  • I'm not sure what you want to do about it - maybe you want to de-duplicate one of the inputs. Maybe you want to merge on an additional column so it is clear which rows to match. Maybe that additional column needs to be created. Maybe, now that you understand what is happening, you realize it is what you want. If you need more help, please provide sample **input**, just a few rows from each data frame with (matching) repeated `NUMERO` values along with a description of what you want the output to be. – Gregor Thomas Nov 08 '22 at 16:49

0 Answers0