362

I'm trying to search for the word Gadaffi, which can be spelled in many different ways. What's the best regular expression to search for this?

This is a list of 30 variants:

Gadaffi
Gadafi
Gadafy
Gaddafi
Gaddafy
Gaddhafi
Gadhafi
Gathafi
Ghadaffi
Ghadafi
Ghaddafi
Ghaddafy
Gheddafi
Kadaffi
Kadafi
Kaddafi
Kadhafi
Kazzafi
Khadaffy
Khadafy
Khaddafi
Qadafi
Qaddafi
Qadhafi
Qadhdhafi
Qadthafi
Qathafi
Quathafi
Qudhafi
Kad'afi

My best attempt so far is:

\b[KG]h?add?af?fi$\b

But I still seem to be missing some variants. Any suggestions?

SiggyF
  • 22,088
  • 8
  • 43
  • 57
  • 2
    How do you know that you are missing some journals? – heldt Mar 19 '11 at 22:15
  • 8
    Which ones are you missing? And where are you searching, is there a web-search with regex? – Czechnology Mar 19 '11 at 22:15
  • 1
    Every time I read a new journal I find a new spelling. For example NY Times uses Qaddafi. – SiggyF Mar 19 '11 at 22:19
  • 43
    There are always new journals published, so if they keep writing about Gadaffi you'll tend towards `.+` being the only valid regular expression. – moinudin Mar 19 '11 at 22:22
  • 30
    I found that this picture helps with the different spellings: http://upload.wikimedia.org/math/6/1/f/61f34aa25871e9546b6a11243e1bed31.png – KLee1 Mar 19 '11 at 22:54
  • 24
    As usual, Lisp implemented this first - http://www.foldr.org/~michaelw/projects/regex/regexp-test-suite.lisp (scroll about half-way down) – Daniel S. Sterling Mar 21 '11 at 20:10
  • The difficulty of translation (thus reason for so many variations) is a consequence of the fact that the name is a direct transliteration from Arabic. – CRice Mar 24 '11 at 05:41
  • @yesterday, I'm not familiar with arabic but why are there so many transliterations into one language (english)? – Czechnology Mar 25 '11 at 21:50
  • Try the same techniques for finding elephants. Just find Gadaffi(s) in Africa instead. Experienced programmers will need a Gadaffi look-alike and mathematicians won't be popular in Africa anymore. Using a regular expression sounds like the worst of all ideas since, you know, now you have two problems. *scnr* – Meinersbur Mar 26 '11 at 16:49
  • 1
    Gheddafi is also accepted by the expression below `$ echo Gheddafi | pcregrep --color "\b(Kh?|Gh?|Qu?)[aeu](d['dt]?|t|zz|dhd)h?aff?[iy]\b"` – SiggyF Apr 06 '11 at 13:41
  • You may not believe me, but I independently just thought of this today and thought it would a neat regex question. Then I find you beat me to it months ago. Th next step would be machine learning which condenses all possible spellings to the most compact regex (not sure if it should exclude spurious alternatives). – smci Aug 13 '11 at 01:39
  • 2
    Good thing there isn't some rule like "the number of d's and f's should be equal", so we can get away without writing a Context-free Grammar :) –  Aug 23 '11 at 20:24
  • 2
    Considering the world is usually split into countries, you would need to recursively search each country, or at least have a loop in the expression to search each sub-population. I don't think regex is the right way to find Gaddaffi. It's not like he's hiding in a HTML document. – ssube Aug 23 '11 at 22:36
  • 1
    @peachykeen he *is* a very wiley fox, he could be hiding *anywhere!* – glenatron Aug 23 '11 at 22:54
  • 7
    @Daniel Sterling: actually, the Khadafy test is part of the GNU grep testsuite since the initial commit to RCS (Tue Nov 3 21:38:52 1998 +0000), and is probably even older than that! – Paolo Bonzini Aug 24 '11 at 08:38
  • http://www.sporcle.com/games/SporcleEXP/Gadaffi – Moe Aug 24 '11 at 13:12
  • 1
    Rex produces random result set members given a regex. Here's one for your example: http://rise4fun.com/Rex/ELh – miku Aug 01 '12 at 23:40
  • I would not only try regexes but also [Levenshtein distance](http://en.wikipedia.org/wiki/Levenshtein_distance) – Sjuul Janssen Aug 24 '11 at 12:54
  • If you found a list online, you don't even need regex! Just copy-paste the list, and using some string processing, make it into a list. Use a programming language to see if the words in the article are in the list. –  Sep 25 '15 at 15:33
  • Become familiar with precision and recall. Which one is more important will determine if a regex or explicit list is more apropriate. https://en.wikipedia.org/wiki/Precision_and_recall – Trenton Mar 30 '16 at 05:31

15 Answers15

278

Easy... (Qadaffi|Khadafy|Qadafi|...)... it's self-documented, maintainable, and assuming your regexp engine actually compiles regular expressions (rather than interpreting them), it will compile to the same DFA that a more obfuscated solution would.

Writing compact regular expressions is like using short variable names to speed up a program. It only helps if your compiler is brain-dead.

Chris Pacejo
  • 2,817
  • 4
  • 17
  • 17
  • 24
    Great answer! People use regular expressions far more often than they care about how they actually work. – Thomas Ahle Mar 28 '11 at 14:29
  • 3
    I really like the simplicity of this solution too, but I'm surprised that this will compile down to the same DFA. Do you have a link that talks about this? Intuitively this seems like it might be less efficient than the previously crafted regex or the answer below which suggests using the Regexp::Assemble perl module on the same list of of or'd names. – Rian Sanderson Aug 25 '11 at 00:08
  • 6
    -1 The whole point of a regex is to reduce what can often be -- as it is in this case -- a very long list of alternatives to relatively short formula. The result can often execute faster than doing what is essentially an unoptimized exhaustive search. – martineau Aug 25 '11 at 15:50
  • 7
    You're right, that the point of regexes is to provide a compact, clear representation for a large set of values. But the basic concept is to present a regex and say "anything that matches this is good." That is, it assumes you have the freedom to include anything systematic. Here, we have the opposite situation: the variant spellings (and the variations that never appear) are only barely this side of 'utterly random.' The elaborate attempts at "compact" get very low points for "clear"! – jackr Oct 20 '11 at 23:14
  • The major regex flavors do not compile to DFA. – Qtax May 18 '13 at 13:31
  • 1
    Also check out the Aho-Corasick algorithm, which is optimal for simultaneously string searching: https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm – Thomas Ahle Jun 12 '14 at 16:07
  • Is there any way to see what the engine makes out of it? – phk Oct 09 '16 at 11:48
139

\b[KGQ]h?add?h?af?fi\b

Arabic transcription is (Wiki says) "Qaḏḏāfī", so maybe adding a Q. And one H ("Gadhafi", as the article (see below) mentions).

Btw, why is there a $ at the end of the regex?


Btw, nice article on the topic:

Gaddafi, Kadafi, or Qaddafi? Why is the Libyan leader’s name spelled so many different ways?.


EDIT

To match all the names in the article you've mentioned later, this should match them all. Let's just hope it won't match a lot of other stuff :D

\b(Kh?|Gh?|Qu?)[aeu](d['dt]?|t|zz|dhd)h?aff?[iy]\b
Lucas
  • 523
  • 2
  • 10
  • 20
Czechnology
  • 14,832
  • 10
  • 62
  • 88
  • The $ is wrong, I was matching end of lines first, forgot to remove it. – SiggyF Mar 19 '11 at 22:23
  • Does `d` also match a ḏ? – SiggyF Mar 19 '11 at 22:30
  • 2
    @DiggyF, no, I just thought that if the arabic transcription says `Qaḏḏāfī`, the regex should check for `Qaddafi` too. If you want to look for the arabic transcription too, just search for that one - I don't think there are more variants of the arabic transcription, only of the english transcriptions. – Czechnology Mar 19 '11 at 22:34
  • @DiggyF, I've editted in a longer regex which matches all of the names in the article you've posted (except the two with `?` instead of letters). Might be an overkill though. – Czechnology Mar 19 '11 at 22:41
  • Yep I tested it, this got them all. I'll add the list of the 30 variants also to the question. Thanks very much. – SiggyF Mar 19 '11 at 22:48
  • 2
    This also matches 'Quuzzafi' and a bunch of other false positives, though I suppose that in searching through news reports etc. that won't matter much. – ben w Sep 21 '11 at 22:19
  • @Czechnology - perhaps you could add this diagram: https://www.debuggex.com/r/V69MyxmjYKi5YxX7 – Billy Moon Apr 03 '14 at 15:58
45

One interesting thing to note from your list of potential spellings is that there's only 3 Soundex values for the contained list (if you ignore the outlier 'Kazzafi')

G310, K310, Q310

Now, there are false positives in there ('Godby' also is G310), but by combining the limited metaphone hits as well, you can eliminate them.

<?
$soundexMatch = array('G310','K310','Q310');
$metaphoneMatch = array('KTF','KTHF','FTF','KHTF','K0F');

$text = "This is a big glob of text about Mr. Gaddafi. Even using compound-Khadafy terms in here, then we might find Mr Qudhafi to be matched fairly well. For example even with apostrophes sprinkled randomly like in Kad'afi, you won't find false positives matched like godfrey, or godby, or even kabbadi";

$wordArray = preg_split('/[\s,.;-]+/',$text);
foreach ($wordArray as $item){
    $rate = in_array(soundex($item),$soundexMatch) + in_array(metaphone($item),$metaphoneMatch);
    if ($rate > 1){
        $matches[] = $item;
    }
}
$pattern = implode("|",$matches);
$text = preg_replace("/($pattern)/","<b>$1</b>",$text);
echo $text;
?>

A few tweaks, and lets say some cyrillic transliteration, and you'll have a fairly robust solution.

tomwalsham
  • 781
  • 4
  • 4
  • 2
    Please note, soundex is specialized to English, there do exist other phonetic algorithms for other languages with different pronunciation rules – Incognito Mar 21 '11 at 18:01
  • 8
    While this is true, we're in an odd situation here. The primary request was "I'm trying to search for the word Gadaffi", but I feel the regex was a red herring. There is no rulebook on Arabic->latin transliteration, and as such reversing a regex from a list will not fully answer the original request. – tomwalsham Mar 21 '11 at 18:19
  • 2
    I feel a fuzzy-matching system is better suited, but a custom algorithm seems overkill. Using a soundex-metaphone combo seems to perform as well as the regex solution, allowing for further unanticipated spellings while still just using off-the-shelf algos. – tomwalsham Mar 21 '11 at 18:26
  • Use of metaphone2 and metaphone3 leads to better results (ie, almost everything in metaphone2 is KDF, where as metaphone1 isn't quite). Metaphone3 however, costs about 40 bucks. – Incognito Mar 21 '11 at 18:32
27

Using CPAN module Regexp::Assemble:

#!/usr/bin/env perl

use Regexp::Assemble;

my $ra = Regexp::Assemble->new;
$ra->add($_) for qw(Gadaffi Gadafi Gadafy Gaddafi Gaddafy
                    Gaddhafi Gadhafi Gathafi Ghadaffi Ghadafi
                    Ghaddafi Ghaddafy Gheddafi Kadaffi Kadafi
                    Kaddafi Kadhafi Kazzafi Khadaffy Khadafy
                    Khaddafi Qadafi Qaddafi Qadhafi Qadhdhafi
                    Qadthafi Qathafi Quathafi Qudhafi Kad'afi);
say $ra->re;

This produces the following regular expression:

(?-xism:(?:G(?:a(?:d(?:d(?:af[iy]|hafi)|af(?:f?i|y)|hafi)|thafi)|h(?:ad(?:daf[iy]|af?fi)|eddafi))|K(?:a(?:d(?:['dh]a|af?)|zza)fi|had(?:af?fy|dafi))|Q(?:a(?:d(?:(?:(?:hd)?|t)h|d)?|th)|u(?:at|d)h)afi))
Prakash K
  • 3,020
  • 1
  • 20
  • 11
24

I think you're over complicating things here. The correct regex is as simple as:

\u0627\u0644\u0642\u0630\u0627\u0641\u064a

It matches the concatenation of the seven Arabic Unicode code points that forms the word القذافي (i.e. Gadaffi).

Staffan Nöteberg
  • 4,095
  • 1
  • 19
  • 17
19

If you want to avoid matching things that no-one has used (ie avoid tending towards ".+") your best approach would be to create a regular expression that's just all the alternatives (eg. (Qadafi|Kadafi|...)) then compile that to a DFA, and then convert the DFA back into a regular expression. Assuming a moderately sensible implementation that would give you a "compressed" regular expression that's guaranteed not to contain unexpected variants.

andrew cooke
  • 45,717
  • 10
  • 93
  • 143
  • 2
    I know that that is theretically possible, but how would you do that in practice (using for example som common dynamic language) – Amandasaurus Mar 21 '11 at 15:55
  • 3
    I understand the theory behind this, but like @Rory, I'm also interested to know how you'd actually do this in practice. – dancavallaro Mar 21 '11 at 16:54
  • yeah, i thought about doing it, to give a better answer, but i am a bit busy at the moment. i have some (ugly and poorly documented) code at http://code.google.com/p/lepl/source/browse/src/lepl/regexp/core.py that constructs a dfa from a regexp (actually, the parser is in another class, but the hard work is there; you go regexp -> nfa -> dfa). going from the dfa to a regexp is easy (i think?). – andrew cooke Mar 22 '11 at 00:29
  • actually, the documentation there is better than i remember :o) the basic idea is that you describe the regexp in terms of the classes near the top of the file. that then can be translated to an nfa fairly easily (an nfa is really just a set of transitions saying "if you get this letter than you can go here or here..." that's pretty easy to understand). the dfa is then a kind of "expanded" version of that where you keep avoid having to backtrack; that's done by NfaToDfa (and is the hard part). the dfa can then be though of as a regexp itself that's writen as very complex character sets(?!) – andrew cooke Mar 22 '11 at 00:41
10

If you've got a concrete listing of all 30 possibilities, just concatenate them all together with a bunch of "ors". Then you can be sure that it only matches the exact things you've listed, and no more. Your RE engine will probably be able to optimize in further, and, well, with 30 choices even if it doesn't it's still not a big deal. Trying to fiddle around with manually turning it into a "clever" RE can't possibly turn out better and may turn out worse.

9
(G|Gh|K|Kh|Q|Qh|Q|Qu)(a|au|e|u)(dh|zz|th|d|dd)(dh|th|a|ha|)(\x27|)(a|)(ff|f)(i|y)

Certainly not the most optimized version, split on syllables to maximize matches while trying to make sure we don't get false positives.

Sneaky
  • 91
  • 1
7

Well since you are matching small words why don't you try a similarity search engine with the Levenshtein distance? You can allow at most k insertions or deletions. This way you can change the distance function to other things that work better for your specific problem. There are many functions available in the simMetrics library.

4

A possible alternative is the online tool for generate regular expressions from examples http://regex.inginf.units.it. Give it a chance!

mimmuz
  • 339
  • 4
  • 9
1

I know this is an old question, but...

Neither of these two regexes is the prettiest, but they are optimized and both match ALL the variations in the original post.

"Little Beauty" #1

(?:G(?:a(?:d(?:d(?:af[iy]|hafi)|af(?:f?i|y)|hafi)|thafi)|h(?:ad(?:daf[iy]|af?fi)|eddafi))|K(?:a(?:d(?:['dh]a|af?)|zza)fi|had(?:af?fy|dafi))|Q(?:a(?:d(?:(?:(?:hd)?|t)h|d)?|th)|u(?:at|d)h)afi)

"Little Beauty" #2

(?:(?:Gh|[GK])adaff|(?:(?:Gh|[GKQ])ad|(?:Ghe|(?:[GK]h|[GKQ])a)dd|(?:Gadd|(?:[GKQ]a|Q(?:adh|u))d|(?:Qad|(?:Qu|[GQ])a)t)h|Ka(?:zz|d'))af)i|(?:Khadaff|(?:(?:Kh|G)ad|Gh?add)af)y

Rest in Peace, Muammar.

zx81
  • 41,100
  • 9
  • 89
  • 105
1

Why not do a mixed approach? Something between a list of all possibilities and a complicated Regex that matches far too much.

Regex is about pattern matching and I can't see a pattern for all variants in the list. Trying to do so, will also find things like "Gazzafy" or "Quud'haffi" which are most probably not a used variant and definitly not on the list.

But I can see patterns for some of the variants, and so I ended up with this:

\b(?:Gheddafi|Gathafi|Kazzafi|Kad'afi|Qadhdhafi|Qadthafi|Qudhafi|Qu?athafi|[KG]h?add?h?aff?[iy]|Qad[dh]?afi)\b

At the beginning I list the ones where I can't see a pattern, then followed by some variants where there are patterns.

See it here on www.rubular.com

stema
  • 90,351
  • 20
  • 107
  • 135
0

Just an addendum: you should add "Gheddafi" as alternate spelling. So the RE should be

\b[KG]h?[ae]dd?af?fi$\b
Vito De Tullio
  • 2,332
  • 3
  • 33
  • 52
0

[GQK][ahu]+[dtez]+\'?[adhz]+f{1,2}(i|y)

In parts:

  • [GQK]
  • [ahu]+
  • [dtez]+
  • \'?
  • [adhz]+
  • f{1,2}(i|y)

Note: Just wanted to give a shot at this.

Dinko Pehar
  • 5,454
  • 4
  • 23
  • 57
-1

What else starts with Q, G, or K, has a d, z or t in the middle, and ends in "fi" the people actually search for?

/\b[GQK].+[dzt].+fi\b/i

Done.

>>> print re.search(a, "Gadasadasfiasdas") != None
False
>>> print re.search(a, "Gadasadasfi") != None
True
>>> print re.search(a, "Qa'dafi") != None
True

Interesting that I'm getting downvoted. Can someone leave some false positives in the comments?

Hank
  • 547
  • 3
  • 4
  • 2
    From a cracking dictionary that I happen to have sitting around: `kartografi kryptografi Gaddafi Qaddafi gadafi gaddafi katastloofi katastorfi katastrofi khadaffi kadafi kardiyografi gaskromatografi kardiografi kinematografi kromatografi krystallografi kulturgeografi gandolfi grizzaffi gadhafi kadaffi kaddafi khaddafi qaddafi qadhafi quedaffi gordonsCHsKFI `. Some of those aren't *false* positives, though. – BMDan Aug 27 '11 at 22:01
  • 2
    And the additions to that list that result from ending in `[iy]` instead of just `i`: `gelatinify gentrify ghostlify giddify gladify goutify gratify "Gyula Dessewffy" katasrofy katastrofy khadafy quantify quasi-deify quizzify ` – BMDan Aug 27 '11 at 22:04