Filling in the values of a column based on matching strings from the column of another dataset

Question

I am working with the following two datasets :

will_can

structure(list(will_can.REGION = c("AB", "B", "B", "B", "BB", 
"BB", "BD", "BH", "BH", "BH", "BR", "BS", "BS", "BT", "BT", "CF", 
"CF", "CM", "CO", "CV", "CV", "CV", "CW", "DA", "DA", "DD", "DE", 
"DE", "DG", "DG", "DG", "DG", "DL", "DN", "DT", "E", "E", "E", 
"EH", "EH", "EH", "EH", "EH", "EH", "EH", "EX", "EX", "EX", "FK", 
"FK", "FY", "G", "G", "G", "GL", "GL", "HA", "HD", "HD", "IV", 
"KA", "KA", "KA", "KA", "KA", "KA", "KA", "KA", "KA", "KA", "KA", 
"KA", "KA", "KA", "KA", "KA", "KA", "KT", "KY", "KY", "KY", "L", 
"L", "L", "LA", "LA", "LE", "LE", "M", "M", "ME", "ME", "MK", 
"ML", "N", "N", "N", "NE", "NG", "NN", "NN", "NR", "NW", "OL", 
"OX", "OX", "PH", "PO", "PR", "RG", "RH", "RM", "RM", "S", "S", 
"S", "S", "SA", "SE", "SE", "SE", "SE", "SE", "SG", "SL", "SN", 
"SN", "SO", "SO", "SO", "SS", "ST", "ST", "ST", "ST", "SW", "SW", 
"SW", "SW", "SY", "SY", "SY", "TA", "TD", "TD", "TN", "TW", "UB", 
"UB", "W", "W", "W", "W", "WA", "WC", "WD")), class = "data.frame", row.names = c(NA, 
-156L))

will_can_region_norm

structure(list(norm = c(67.3112073766083, 0, 62.9924341677094, 
0, 134.940019161483, 86.0271073135687, 233.710968710152, 0, 0, 
136.210220315945, 72.0106074505199, 54.9624828839958, 0, 0, 46.5061888459603, 
0, 51.9149234846709, 85.3970454501009, 0, 0, 141.438961332615, 
122.50716299382, 197.887432921107, 96.646567080111, 108.996678489718, 
873.779493880704, 0, 109.106806944561, 56.7421763178016, 249.99781251914, 
0, 106.993398828272, 0, 182.997053590583, 0, 225.716259764203, 
217.655353412983, 98.8344746903195, 70.3435951664196, 106.870878390986, 
0, 0, 113.255439262354, 226.344150395729, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 92.5698187029358, 0, 1159.88543061088, 59.5746039659052, 
0, 217.977759293264, 88.627745595238, 155.299651064979, 0, 70.3301130229532, 
0, 0, 0, 0, 36.166169734453, 162.12380892704, 74.7710230881704, 
112.29824076945, 120.249189991435, 25.6209421071498, 36.7120335621411, 
115.238964414265, 0, 50.4621322067494, 59.9490876378327, 82.9160720202368, 
132.342362545417, 0, 0, 209.987774511768, 0, 45.0104437732687, 
59.5244437425851, 54.7420581590574, 77.921490980977, 132.545922191567, 
100.083647410414, 51.5757713324224, 102.602449571922, 98.8984492920948, 
0, 129.885834248271, 0, 189.332549749021, 149.846130500895, 0, 
0, 73.4653456617979, 220.103517986062, 111.317004279081, 375.711503660056, 
156.229153172374, 760.35739839154, 0, 83.1515916711375, 0, 0, 
0, 73.5483180088058, 269.518568414391, 102.141462145838, 55.2886923953334, 
151.949727736478, 148.297412239816, 0, 0, 0, 0, 0, 0, 0), REGION = c("AB", 
"AL", "B", "BA", "BB", "BD", "BH", "BL", "BN", "BR", "BS", "BT", 
"CA", "CB", "CF", "CH", "CM", "CO", "CR", "CT", "CV", "CW", "DA", 
"DD", "DE", "DG", "DH", "DL", "DN", "DT", "DY", "E", "EC", "EH", 
"EN", "EX", "FK", "FY", "G", "GL", "GU", "GY", "HA", "HD", "HG", 
"HP", "HR", "HS", "HU", "HX", "IG", "IM", "IP", "IV", "JE", "KA", 
"KT", "KW", "KY", "L", "LA", "LD", "LE", "LL", "LN", "LS", "LU", 
"M", "ME", "MK", "ML", "N", "NE", "NG", "NN", "NP", "NR", "NW", 
"OL", "OX", "PA", "PE", "PH", "PL", "PO", "PR", "RG", "RH", "RM", 
"S", "SA", "SE", "SG", "SK", "SL", "SM", "SN", "SO", "SP", "SR", 
"SS", "ST", "SW", "SY", "TA", "TD", "TF", "TN", "TQ", "TR", "TS", 
"TW", "UB", "W", "WA", "WC", "WD", "WF", "WN", "WR", "WS", "WV", 
"YO", "ZE")), row.names = c(NA, -124L), class = "data.frame")

I am trying to add a new column, will_can$norm, which would contain values of will_can_region_norm$norm based on matching values of the variable "REGION" which is the same in both datasets. So gaps from the second dataset to the first would be automatically filled based on matching strings of REGION

Based on another question in the forum I tried the following function:

will_can2 <- merge(will_can, will_can_region_norm[,"norm"], by = "REGION", all=TRUE)

But I get the following error:

Error in fix.by(by.y, y) : 'by' must specify a unique correct column [translated from French]

Is there something I'm missing here? Would be grateful for some help !

Cameron

score 1 · Accepted Answer · answered Feb 15 '21 at 10:45

1

For your merge(will_can, will_can_region_norm[,"norm"], by = "REGION", all=TRUE) command to work, both data.frames would need a column called REGION. In your example:

will_can doesn't have this column, but it does have one called will_can.REGION.
You've extracted a single column from will_can_norm called norm, and tried to merge based on that single column. Unfortunately, the merge() command never sees the REGION column of will_can_norm.

In your case, try something like

merge(will_can, will_can_region_norm, by.x = "will_can.REGION", by.y="REGION", all=TRUE)

answered Feb 15 '21 at 10:45

Jason

2,507
20
25

Thank you, it worked and I then just had to remove rows containing ```NA```! – Cameron Feb 15 '21 at 11:21
1

@Cameron, regarding the `NA` values, have you tried playing with the parameters `all`, `all.x`, `all.y`? – Jason Feb 17 '21 at 04:11
No I didn't, I added a separate function to remove them : `will_can2 <- will_can2[complete.cases(will_can2), ]` – Cameron Feb 18 '21 at 08:23

Filling in the values of a column based on matching strings from the column of another dataset

1 Answers1