Finding strings that differ with at most one letter from a given string in SAS with PROC SQL

Question

First some context. I am using proc sql in SAS, and need to fetch all the entries in a data set (with a couple of million entries) that have variable "Name" equal to (let's say) "Massachusetts". Of course, since the data was once manually entered by humans, close to all conceivable spelling errors occur ("Amssachusetts", "Kassachusetts" etc.).

I have found that few entries get more than two characters wrong, so the code

Name like "__ssachusetts" OR Name like "_a_sachusetts" OR ... OR Name like "Massachuset__"

would select the entries I am looking for. However, I am hoping that there must be a more convenient way to write

Name that differs by at most 2 characters from "Massachusetts";

Is there? Or is there some other strategy for fetching these entries? I tried searching both stackoverflow and the web but was unsuccesful. I am also a relative beginner with both SQL and SAS.

Some additional information: The database is not in English (and the actual string is not "Massachusetts") so using SOUNDEX is not really feasible (if it ever were).

Thanks in advance.

(Edit: Improved the title)

what's your DB engine? e.g. base SAS, DB2 etc? – trickwallett Apr 26 '11 at 13:25 — trickwallett, Apr 26 '11 at 13:25

cmjohns · Accepted Answer · 2011-04-26T19:14:19.150

15

SAS has built-in functions COMPGED and COMPLEV to compute distances between strings. Here is an example that shows how to select just those with a Levenshtein edit distance of less than or equal to 2.

data typo;
input name $20.;
datalines;
massachusetts
masachusets
mssachusetts
nassachusets
nassachussets
massachusett
;

proc sql;
  select name from typo
  where complev(name, "massachusetts") <= 2;
quit;

edited Apr 26 '11 at 19:14

answered Apr 26 '11 at 16:22

cmjohns

4,465
17
21

Wonderful. This appears to be spot on. Ill check it first time in the morning (and then subsequently click the check mark :)). – Har Apr 26 '11 at 21:22

score 4 · Answer 2 · answered Apr 26 '11 at 13:35

4

There are other phonetic algorithms like Hamming distance that should work better. You can search on google for implementation of this algorithm for your specific DB engine.

answered Apr 26 '11 at 13:35

shrutyzet

529
3
11

Thanks. I did not know that it was called the Hamming distance. – Har Apr 26 '11 at 13:44
+1 for another solution. I was curious about the difference between Hamming Distance and [Levenshtein distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) that is cited in the accepted answer. Basically, I don't think Hamming Distance is "better", as implied by this answer, because Levenshtein only treats a transpose as a cost of 1, not 2. So, in the OPs situation, if the input had two sets of transposed letters it would have a Levenshtein score of 2, but a Hamming score of 4. I would only consider a transpose as a single error, so I think Levenshtein is more accurate. – Matt Klein Nov 24 '15 at 22:39

score 3 · Answer 3 · edited May 23 '17 at 12:23

3

What you are looking for is "Approximate string matching". For that one can use "Levenshtein distance computing algorithm". I am not sure, but hope that this answer will help

edited May 23 '17 at 12:23

Community

1
1

answered Apr 26 '11 at 13:41

Draco Ater

20,820
8
62
86

score 2 · Answer 4 · answered Apr 26 '11 at 13:36

2

You could implement a stored function of this type (Oracle syntax, transform to your RDBMS):

CREATE FUNCTION distance(one VARCHAR2, two VARCHAR2) RETURN NUMBER IS
DETERMINISTIC
BEGIN
  -- do some comparison here
END distance;

And then use it in SQL:

SELECT * FROM table WHERE distance(name, 'Massachusetts') <= 2

Of course, these things tend to be quite slow...

answered Apr 26 '11 at 13:36

Lukas Eder

211,314
129
689
1,509

Agreed. That it is slow is fine (after all, I am already doing that and time is not an issue, especially compared to programming time). I will have to look further into the possibility of defining SQL functions from SAS. Thanks. – Har Apr 26 '11 at 13:43

score 0 · Answer 5 · edited May 23 '17 at 12:15

I know this is four years too late but since it might also give ideas to others who are searching this thread: What you're considering is a semantic layered design you would need to implement some conditional logic for these different text comparisons, using Lenvenschtien distances like the Jaro-Winkler for comparing text of differing lengths and Hamming for those of the same length for which you suppose simple text trans-positioning. This is nothing new these days with all of the various text mining programs out there. Here is a post which is very good in my view; Jaro-Winkler string comparison function in SAS

Finding strings that differ with at most one letter from a given string in SAS with PROC SQL

5 Answers5