Remove duplicates in a glossary source column while merging different meanings in target column

Question

I am stuck with finding a Regex to accomplish a task, so please help, I am sure you can rock with a great solution. I build glossaries, and sometimes I get repeated source terms that are exactly the same, but have different targets. See below as an example:

Absolute potential  الجهد المطلق
Absolute potential  جهد مطلق
Absolute potential  جهد مطلقفرق الجهد المطلق بين الفلز والمحلول
**Absolute power    سلطة استبدادية
Absolute power  سلطة مطلقة
Absolute power  قدرة مطلقة**
Absolute power consumption  استهلاك الطاقة الفعلي
Absolute pressure   الضغط المطلق
Absolute prices أسعار مطلقة
Absolute priority   أولوية مطلقة
Absolute priority   الأولوية المُطلقة
Absolute priority   اولوية / اسبقية

Those are tab delimited files. I am looking for a way to look for any repeated Source Term, such as Absolute Power, because it is the same across all 3 lines, and replace with just one entry, just one Absolute Power, while having all the target meanings merged and separated by a pipe character. So the entry would look like this:

**Absolute power    سلطة استبدادية | سلطة مطلقة | قدرة مطلقة**

So I am looking for a Regex to automatically do this task please. So a Term, followed by a Tab character, followed by the merged Arabic entries separated by Pipe Characters, across the whole large glossary text file. That would really make my day. Sincerely, Sam

Yes, I can find by this Regex: ^(.+?)\R(\1\R?)+ , the question is replace with what to make sure I reach my target desired, Although even this regex can only find some repeated source entries and not all of them! — Sam Mouha, Jan 16 '20 at 14:35
See the below link to see a screenshot of how it looks like: http://www.atg2ftp.com/screen.jpg — Sam Mouha, Jan 16 '20 at 14:44

score 0 · Accepted Answer · answered Jan 16 '20 at 15:11

0

Here is an example of how you can do it by using a script assuming the input file glossary.txt is UTF8 encoded:

use feature qw(say);
use strict;
use warnings;
use open qw(:std IN :encoding(utf-8) OUT :utf8); 

my $fn = 'glossary.txt';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my %target;
my @order;
while( my $line = <$fh> ) {
    chomp $line;
    my ($subject, $target) = split "\t", $line;
    if (exists $target{$subject}) {
        push @{ $target{$subject} }, $target;
    }
    else {
        $target{$subject} = [ $target ];
        push @order, $subject;
    }
}
close $fh;

for my $subject (@order) {
    say $subject . "\t" . join " | ", @{$target{$subject}};
}

answered Jan 16 '20 at 15:11

Håkon Hægland

39,012
21
81
174

Pardon my ignorance, but how to run this script? I mean using which software? I am not a developer or programmer so please pardon my ignorance in this field. Thanks a lot for assisting me. – Sam Mouha Jan 16 '20 at 17:25
@SamMouha Just save the script with a name, e.g. `fixup_glossary.pl`, then run it from the terminal with `perl fixup_glossary.pl` – Håkon Hægland Jan 16 '20 at 17:30
I managed to do it, but in the CMD prompt, after running the script, I got this strange encoded text Single entity tours ╪د┘╪▒╪ص┘╪د╪ز ╪د┘╪░╪د╪ز┘è╪ر ╪د┘┘à╪ز╪╣╪»╪»╪ر Single entry accounting ╪ص╪│╪د╪ذ╪د╪ز ╪ذ╪د┘┘é┘è╪» ╪د┘┘à┘┘╪▒╪» Single In Line Package ╪▒╪▓┘à╪ر ╪ث╪ص╪د╪»┘è╪ر ╪د┘╪ز╪▒╪د╪╡┘ single indicator method of deflation ╪د┘╪▒┘é┘à ╪د┘┘é┘è╪د╪│┘è ╪د┘┘à┘╪▒╪» (╪د┘╪ز╪╡╪ص┘è╪ص ╪د┘┘à┘┘╪▒╪») Single line ╪«╪╖ ┘à┘╪▒╪» – Sam Mouha Jan 16 '20 at 18:05
Ok, I am on Linux. Maybe there are some encoding issues on Windows? I will try to test on a windows machine – Håkon Hægland Jan 16 '20 at 18:08
1

It worked, I just had to tweak the path as per the below: perl d:\fixup_glossary.pl > d:\glossary_fixed.txt You are just great Hakon, I do appreciate your great help on this one. – Sam Mouha Jan 16 '20 at 18:53
Is there a reason you used the discouraged `:utf8` layer for output when already using the correct one for input? – Grinnz Jan 16 '20 at 22:14
@SamMouha If you are on windows, you may need to set your terminal encoding to UTF-8 for the output to appear correct there (the command is chcp, google or stack overflow can find the correct invocation that you need). – Grinnz Jan 16 '20 at 22:17
@Grinnz *"Is there a reason you used the discouraged `:utf8` layer for output"* According to [perldoc perlunifaq](https://perldoc.perl.org/5.30.0/perlunifaq.html) using `:utf8` *"..is widely accepted as good behavior when you're writing, but it can be dangerous when reading..."* see also [this](https://stackoverflow.com/q/14566460/2173773) question. – Håkon Hægland Jan 16 '20 at 23:47
As always "widely accepted" is a time-sensitive and opinion-sensitive metric :) I consider any usage of it to be wrong unless it is dealing with the intended purpose of the `utf8` layer: the internal Perl string format. – Grinnz Jan 17 '20 at 01:21
There are of course [better options](https://metacpan.org/pod/PerlIO::utf8_strict) than `:encoding(UTF-8)`, but it is the best core option for the use case of translating between UTF-8 bytes and characters. – Grinnz Jan 17 '20 at 01:23

Remove duplicates in a glossary source column while merging different meanings in target column

1 Answers1