regex match repeated line-initial strings and delete repetitions

Question

I'm a newbie with regex and am pretty sure this question has been answered somewhere, but I haven't succeeded in tweaking what I've found to do the job. I'm working with a dictionary file with repeated headwords, which cause the compiler to fail. So I need to match exact head words (all of which don't contain characters such as "[" and "<") at the beginning of a line and delete the repetitions. But there are many, many duplicate head words across the file, so I would like to replace matches automatically. Here's an example from the dictionary:

aGga
<© aGga @>
[m1]aṅgá [/m]
[m2][trn][i]pel. ¤1.¤ emphatic[/i]: just, only; especially; ¤2,¤ [i]exhortative[/i]: [i]w. voc. or impv.[/i]; ¤3.¤ [i]intr.[/i]: [/trn][/m]
[m2][trn][b]kim aṅga,[/b] how much more?[/trn][/m]

aGga
<© aGga @>
[m1]áṅga [/m]
[m2][trn][i]m. pl. No of a people and their country.[/i][/trn][/m]

Here I would need to match the identical head words ("aGga") and then delete the second, third, etc., instances (the second "aGga") as well as their following line (which happens to between < and > ["<© aGga @>"], producing this desired output:

aGga
<© aGga @>
[m1]aṅgá [/m]
[m2][trn][i]pel. ¤1.¤ emphatic[/i]: just, only; especially; ¤2,¤ [i]exhortative[/i]: [i]w. voc. or impv.[/i]; ¤3.¤ [i]intr.[/i]: [/trn][/m]
[m2][trn][b]kim aṅga,[/b] how much more?[/trn][/m]
[m1]áṅga [/m]
[m2][trn][i]m. pl. No of a people and their country.[/i][/trn][/m]

I've seen 3 instances of a headword, so I need to look for more than just one repetition of any given headword.

My attempts so far (such as "^(.+?\s)" based on this question) just at matching identical headwords are returning too much. I'm mostly using the regex find and replace function in Sublime Text, but would be happy to do this in any way possible. I know this is probably really simple and boring for regex gurus, so thanks for taking the time to help a newbie.

Could you post an expamle output? I can't quite figure out what you're trying to do. Do you want to delete the complete second entry? Or just the heading? — Patrick J. S., Sep 29 '14 at 22:58
This is probably trivial in Perl. Are identical headwords sequential with no breaks or are they interlaced with others ? — , Sep 30 '14 at 00:04
Repeated headwords are sequential in the sense that their multiline text blocks follow one after the other. But, as above, several lines will always intervene between repeated headwords (5 lines in the example above). — camatkara, Sep 30 '14 at 00:08

Casimir et Hippolyte · Answer 1 · 2014-09-30T01:49:05.253

A way with perl:

my $data = 'aGga
<© aGga @>
[m1]aṅgá [/m]
[m2][trn][i]pel. ¤1.¤ emphatic[/i]: just, only; especially; ¤2,¤ [i]exhortative[/i]: [i]w. voc. or impv.[/i]; ¤3.¤ [i]intr.[/i]: [/trn][/m]
[m2][trn][b]kim aṅga,[/b] how much more?

aGga
<© aGga @>
[m1]áṅga [/m]
[m2][trn][i]m. pl. No of a people and their country.[/i]

aGga
<© aGga @>
[m2][trn][i]m. pl. No of a people and their country.[/i]

gubo
<© gubo @>
kjhkjhkj hkjhk jhk kjhkjh khk hkjh kj hkj';
$data =~ s/^
(?|
    \G(?!\A) ([^[<\s]+) \R <©\ \1\ @>  # contigous 
  |
    ([^[<\s]+) \R <©\ \1\ @> \K        # new item
)
( (?>\R.+)* )      # block: group 2
(?: \R\R (?= \1 \R <©[^>]+@> $ ) )?
/$2/gmx;
print $data;

Thanks for your answer -- I just accepted the one above, which was the first to work for me, though. — camatkara, Sep 30 '14 at 01:39

score 3 · Accepted Answer · 2014-09-30T01:29:28.067

edit: Some open/close stuff for utf8

# Open a temp file for writing as utf8
# Output to this file will be automatically encoded from Perl internal to utf8 octets
# Write the internal string
# Check the file with a utf8 editor
# ---------------------------------------------- 
open (my $out, '>:utf8', 'temp.txt') or die "can't open temp.txt for writing $!";
print $out $internal_string_1;
close $out;


# Open the temp file for readin as utf8
# All input from this file will be automatically decoded as utf8 octets to Perl internal
# Read/decode to a different internal string
# ----------------------------------------------
open (my $in, '<:utf8', 'temp.txt') or die "can't open temp.txt for reading $!";
$/ = undef;
my $internal_string_2 = <$in>;
close $in;

Sorry took so long.
This is one way, it uses a global substitution with a callback.
For this to work, the blocks must be sequential.

If the blocks aren't sequential, the solution would have to be expanded.

 # /((?<=^)\s*)^([^<\[\n]+?)(\s*\n\s*<.*>.*(?:\n|$))/

 (                             # (1 start), Ws trim
      (?<= ^ )
      \s* 
 )                             # (1 end)
 ^                             # BOL
 ( [^<\[\n]+? )                # (2), Head
 (                             # (3 start), Angle head
      \s* \n \s* < .* > .* 
      (?: \n | $ )                  # Newline or EOL
 )                             # (3 end)

Perl sample:

use strict;
use warnings;

$/ = undef;
#my $filehandle = open(..);
#my $data = <$filehandle>;

my $data = <DATA>;


my $lasthead = "";


sub StripDupHead
{
   my ($wstrim, $head, $angle_head ) = @_;
   if ( $head eq $lasthead ) {
      return "";
   }
   $lasthead = $head;
   return $wstrim . $head . $angle_head;
}

$data =~ s/((?<=^)\s*)^([^<\[\r\n]+?)(\s*\r?\n\s*<.*>.*(?:\r?\n|$))/StripDupHead($1,$2,$3)/emg;

print $data, "\n";
# print $filehandle $data, "\n";
# close ($filehandle);

__DATA__

aGga
<© aGga @>
[m1]a?gá [/m]
[m2][trn][i]pel. ¤1.¤ emphatic[/i]: just, only; especially; ¤2,¤ [i]exhortative[/i]: [i]w. voc. or impv.[/i]; ¤3.¤ [i]intr.[/i]: [/trn][/m]
[m2][trn][b]kim a?ga,[/b] how much more?[/trn][/m]

aGga
<© aGga @>
[m1]á?ga [/m]
[m2][trn][i]m. pl. No of a people and their country.[/i][/trn][/m]

bGgb
<© bGgb @>
[m1]a?gá [/m]
[m2][trn][i]pel. ¤1.¤ emphatic[/i]: just, only; especially; ¤2,¤ [i]exhortative[/i]: [i]w. voc. or impv.[/i]; ¤3.¤ [i]intr.[/i]: [/trn][/m]
[m2][trn][b]kim a?ga,[/b] how much more?[/trn][/m]

cGgc
<© cGgc @>
[m1]á?ga [/m]
[m2][trn][i]m. pl. No of a people and their country.[/i][/trn][/m]

cGgc
<© cGgc @>
[m1]a?gá [/m]
[m2][trn][i]pel. ¤1.¤ emphatic[/i]: just, only; especially; ¤2,¤ [i]exhortative[/i]: [i]w. voc. or impv.[/i]; ¤3.¤ [i]intr.[/i]: [/trn][/m]
[m2][trn][b]kim a?ga,[/b] how much more?[/trn][/m]

cGgc
<© cGgc @>
[m1]á?ga [/m]
[m2][trn][i]m. pl. No of a people and their country.[/i][/trn][/m]

Output:

aGga
<© aGga @>
[m1]a?gá [/m]
[m2][trn][i]pel. ¤1.¤ emphatic[/i]: just, only; especially; ¤2,¤ [i]exhortative[/i]: [i]w. voc. or impv.[/i]; ¤3.¤ [i]intr.[/i]: [/trn][/m]
[m2][trn][b]kim a?ga,[/b] how much more?[/trn][/m]
[m1]á?ga [/m]
[m2][trn][i]m. pl. No of a people and their country.[/i][/trn][/m]

bGgb
<© bGgb @>
[m1]a?gá [/m]
[m2][trn][i]pel. ¤1.¤ emphatic[/i]: just, only; especially; ¤2,¤ [i]exhortative[/i]: [i]w. voc. or impv.[/i]; ¤3.¤ [i]intr.[/i]: [/trn][/m]
[m2][trn][b]kim a?ga,[/b] how much more?[/trn][/m]

cGgc
<© cGgc @>
[m1]á?ga [/m]
[m2][trn][i]m. pl. No of a people and their country.[/i][/trn][/m]
[m1]a?gá [/m]
[m2][trn][i]pel. ¤1.¤ emphatic[/i]: just, only; especially; ¤2,¤ [i]exhortative[/i]: [i]w. voc. or impv.[/i]; ¤3.¤ [i]intr.[/i]: [/trn][/m]
[m2][trn][b]kim a?ga,[/b] how much more?[/trn][/m]
[m1]á?ga [/m]
[m2][trn][i]m. pl. No of a people and their country.[/i][/trn][/m]

Thanks. Since my file has non-ascii characters (ṅ in line 3), do we have to have: open(my $fh, '>:encoding(UTF-8)', $filename) or die "Could not open file '$filename'"; in there so they don't get replaced with "?" (a?ga instead of aṅga)? i'll give it a whirl. — camatkara, Sep 30 '14 at 01:13
Yep, you gotta use something like that. There is a whole thing about BOM's, encode(), decode(), encoding(), etc ... to determine what the file is (unless you already know). Either way, Perl uses a byte code internally for speed. Regex wise, it will handle it just fine. — , Sep 30 '14 at 01:21
I added some file open/close utf8 sample stuff from my archives. I can't remember if they work, think they do. — , Sep 30 '14 at 01:30
actually, I ran the script as-is and it worked like a charm. Thank you so much! Saved me a ton of time... Interestingly, when I put the regex in Sublime Text, it captured way too much. — camatkara, Sep 30 '14 at 01:36
Could be the default is dot-all in sublime. Replacing all the dot's `.` with `[^\S\r\n]` might fix that. — , Sep 30 '14 at 01:44

regex match repeated line-initial strings and delete repetitions

2 Answers2