Search for Characters Similar to Special Characters

Question

I have several old text data files that were generated back in the 90's using an old DOS-era word processor. Due to the limitations present at the time, there are many, many entries that were "simplified" during the data input process.

For example, the word "Náufragos" was entered as "Naufragos".

Now, when searching for "Náufragos" in said data files, I use grep to find "Náufragos" and the search comes up empty (which it should), but I really need said search to find and output "Naufragos".

I've combed the grep documentation and have Googled extensively, but have come up empty.

Any solution needs to handle cases involving most (if not all) character "variations" that are based on the Latin alphabet (i.e. there are no Chinese, Cyrillic, Japanese, etc. present in said old data files).

Is there a grep or, perhaps, perl option that does this? Perhaps something like:

grep -<magic option> Náufragos file.txt

Take a look at this Perl module: http://search.cpan.org/~bkb/Text-Fuzzy-0.24/lib/Text/Fuzzy.pod. It can compare the words and return their "similarity index". For your sample word, the index should equal "1", as a single character is changed. — bart, Apr 10 '16 at 18:35

score 1 · Accepted Answer · answered Apr 10 '16 at 21:06

To ignore diacritics, you can use a search using the Unicode Collation Algorithm at level 1.

#!/usr/bin/perl

use strict;
use warnings;
use Unicode::Collate;

my $collator=Unicode::Collate->new(level => 1, normalization => undef);

while (<>) {
        print if $collator->match($_, "Naufragos")
}

Naming this script as ucagrep.pl:

$ echo -e "Náufragos\nNaufragos\nÑaufragos" | perl -CS ucagrep.pl 
Náufragos
Naufragos
Ñaufragos

Uhh. We better specify the locale:

#!/usr/bin/perl

use strict;
use warnings;
use Unicode::Collate::Locale;

my $collator=Unicode::Collate::Locale->new(locale => "es", level => 1, normalization => undef);

while (<>) {
        print if $collator->match($_, "Naufragos")
}

Testing it:

$ echo -e "Náufragos\nNaufragos\nÑaufragos" | perl -CS ucagrep.pl 
Náufragos
Naufragos

Much better.

score 0 · Answer 2 · edited May 23 '17 at 11:59

You can always grep using ranges of characters, e.g.,

grep -i 'N[aá]ufragos' *

to match either spelling of the name, and if that is a nuisance, a script usingText::Unidecode as discussed in How to convert letters with accents, umlauts, etc to their ASCII counterparts in Perl? could generate the range expressions (since you are likely dealing only with the few dozen characters in ISO-8859-1 which have diacritical marks).

One drawback to Text::Unidecode is that it is unlikely to be preinstalled on a system (and I see for example no package in Debian). You would get that directly from CPAN, e.g., using cpanminus.

Here is a simple example just searching for the old names (cpanminus puts the package in a non-standard location):

#!/usr/bin/perl -w

use strict;
use lib '/usr/local/lib/perl';
use Text::Unidecode;

my @args = unidecode(@ARGV);

for my $n ( 0 .. $#args ) {
    my $name = $args[$n];
    printf "** grep %s ->%s\n", $ARGV[$n], $args[$n];
    system("grep -r \"$name\" .");
}

1;

However, a better script would match both old/new names, since it is easy to overlook files that were converted. Whether to ignore case is also something to consider.

Search for Characters Similar to Special Characters

2 Answers2