1

I am dealing with big data and I would like to replace column values by grep.

Here is a simple example:

data=data.frame(state=c("AlAbama","Alaaska","Arizoona"),
                    gender=c("male","female","female"))
datalist=data.frame(state=c("Alabama","Alaska","Arizona"))

In the data called "data", I have the states names written in a wrong way and I would like to replace them by the write names in datalist. So I would like to replace column values in "data" with column values in "datalist" using grep.

I have tried this:

data[grep(data[,"state"],datalist[,"state"])]

And this:

for (u in datalist$state){
data[grep(u,datalist$state)]
}

But it doesn't work.

Do you have any ideas how to solve this problem?

Sincerely yours, Mily

oguz ismail
  • 1
  • 16
  • 47
  • 69
Mily
  • 331
  • 1
  • 3
  • 18

2 Answers2

0

There is multiple way to solve this, but it may be time consuming depending on the kind of cleaning you want to do.

First, grep only replace a set of character. So it is alright only if you have repeting pattern happening. For exemple, if all the "a" are doubled, you could use grep to replace "aa" by "a".

If you have a limited number of states entered in the wrong way, you could manually create a lookup table with the incorrect values associated with the correct value and simply do a merge. Or do for each states entered in the wrong way a grep step, but it will be longer.

If you have a high number of pattern to clean, you could use open refine, which have a half manual cluster functionality and different metrics implemented to easily cluster your states.

YCR
  • 3,794
  • 3
  • 25
  • 29
0

Quick try as an answer.

library(stringdist)

data1 <- data.frame(
  state = c("AlAbama", "Alaaska", "Arizoona", "NY"),
  gender = c("male", "female", "female", "unkn")
)
datalist <- data.frame(state = c("Arizona", "Alabama", "Alaska"))

dist_m <- data.frame(stringdistmatrix(data1$state, datalist$state))

dist_m$minID <- apply(dist_m, 1, which.min)

data1$state2 <- datalist$state[dist_m$minID]

And the result:

     state gender  state2
1  AlAbama   male Alabama
2  Alaaska female  Alaska
3 Arizoona female Arizona
4       NY   unkn  Alaska

Of course you will need to be careful with this, but as an ad-hoc solution should be fine.

You should also read about method and weight parameters of the stringdist function.

m-dz
  • 2,342
  • 17
  • 29
  • Thanks :) Have a nice day. – Mily May 11 '16 at 09:05
  • You too! Let me know if it worked as expected. It might be a good idea to use a lookup with full state names and their abbreviations and then, as a second step, changing the abbreviations to full names. – m-dz May 11 '16 at 09:13