I have a data-set with 25 columns and over 600k observations, of which one column is named 'destinations'. This column has destinations like Singapore etc. written in different manners e.g. Singapore is written as SINGAPORE, S'PORE, SINGPORE etc. in 61 different fashions. I intend to standardize all these values corresponding to SINGAPORE and allot a certain code to this particular destination for further analysis.
I have tried using grep
and gsub()
to identify destinations starting with ZHO
NOV1151Sub <- NOV1151[grep("ZHO", NOV1151$destination)]
Also, I have tried using the 'stringr' package with no effect.
As such, I want to identify a string in big data-set e.g. 'PORE' is common in all values corresponding to Singapore and replace it with 'SGR' for further analysis, table looks like
NAME destination
a S'PORE
b SINPORE
C SINGAPORE
d XIAM
e XIAMIN
f XIAMEN
g YANTIAN
h YANTAI
i ZHANGJIANG
j ZHANGJIAGANG
k RTD
l ROTTER
desired output
NAME destination
a SINGAPORE
b SINGAPORE
c SINGAPORE
d XIAMEN
e XIAMEN
f XIAMEN
g YANTIAN
h YANTAI
i ZHANGJIAGANG
j ZHANGJAIGANG
k ROTTERDAM
l ROTTERDAM
After having fixed the syntax for changing the pattern, how can I write a function for using this same syntax in a data-set of a different name? For example,I want to change any pattern having the sequence 'ZOU' to 'ZOUSHAN' and many others like this one.
To change the pattern in the destination column of NOV1151 data-set, I used the following code NOV1151$destination <- gsub(".ZOU.", "ZHOUSHAN", NOV1151$destination)
For writing the function, I looked at the source code of gsub()
and str_replace
from the stringr
package and wrote a code to replicate the effects, but got the following error:
Error in Gen(MAY214) : argument "x" is missing, with no default while changing the same pattern in the MAY214 dataset. I named my function Gen
Should I first make a reference .CSV file and then try to use it to change the patterns in any dataset or it can be done in better manner?