2

I have a data-set with 25 columns and over 600k observations, of which one column is named 'destinations'. This column has destinations like Singapore etc. written in different manners e.g. Singapore is written as SINGAPORE, S'PORE, SINGPORE etc. in 61 different fashions. I intend to standardize all these values corresponding to SINGAPORE and allot a certain code to this particular destination for further analysis.

I have tried using grep and gsub() to identify destinations starting with ZHO

NOV1151Sub <- NOV1151[grep("ZHO", NOV1151$destination)]

Also, I have tried using the 'stringr' package with no effect.

As such, I want to identify a string in big data-set e.g. 'PORE' is common in all values corresponding to Singapore and replace it with 'SGR' for further analysis, table looks like

NAME  destination
a     S'PORE
b     SINPORE
C     SINGAPORE
d     XIAM
e     XIAMIN
f     XIAMEN
g     YANTIAN
h     YANTAI
i     ZHANGJIANG
j     ZHANGJIAGANG
k     RTD
l     ROTTER

desired output

NAME  destination
a     SINGAPORE 
b     SINGAPORE
c     SINGAPORE
d     XIAMEN
e     XIAMEN
f     XIAMEN
g     YANTIAN
h     YANTAI
i     ZHANGJIAGANG
j     ZHANGJAIGANG      
k     ROTTERDAM
l     ROTTERDAM

After having fixed the syntax for changing the pattern, how can I write a function for using this same syntax in a data-set of a different name? For example,I want to change any pattern having the sequence 'ZOU' to 'ZOUSHAN' and many others like this one.

To change the pattern in the destination column of NOV1151 data-set, I used the following code NOV1151$destination <- gsub(".ZOU.", "ZHOUSHAN", NOV1151$destination)

For writing the function, I looked at the source code of gsub() and str_replace from the stringr package and wrote a code to replicate the effects, but got the following error:

Error in Gen(MAY214) : argument "x" is missing, with no default while changing the same pattern in the MAY214 dataset. I named my function Gen

Should I first make a reference .CSV file and then try to use it to change the patterns in any dataset or it can be done in better manner?

marine8115
  • 588
  • 3
  • 22

2 Answers2

1

You may find some help in the CRAN package "stringdist". Note the included function "stringdistmatrix" will give a measure of the difference among elements of a vector of strings. For the data set you have provided, you can get the specified result by combining elements that have a distance of four or less into the same group, using the metric "osa". Perhaps the longest, or most frequent string in the group could be assigned as the group name. The amount of manual attention, and the acceptability of the outcome in the "real-world" will require some careful consideration.

Rick
  • 888
  • 8
  • 10
0

NOV1151$destination <- gsub(".PORE.", "SGR", NOV1151$destination) also works fine!! Do take into consideration combinations while using the above code. For example, for INCHEON, using NOV1151$destination <- gsub(".INCH.", "INCHEON", NOV1151$destination) will include TIANJINCHINA too as thhis text also has a sequence'INCH'. Observe the lookup table and use the filter option in R effectively to avoid such errors.

The answer has been provided by Pierre Lafortune

marine8115
  • 588
  • 3
  • 22