Search-and-replace regex with capture

Question

I have a long text file that I want to mostly remain the same, but certain phrases need translated. It's not exactly a clean search-and-replace... For example, I need to change each occurrence of this...

lis r3, ha16(aLabel)

...into this:

lis r3, aLabel@ha

I.e. I need to find the whole ha16(aLabel), capture the aLabel from it (which could be any identifier text up to the terminating end-paren), and then emit a replacement of the captured text followed by @ha.

I've found examples galore of perl search-and-replace, but I haven't come across anything quite like what I need, and other posts that mention 'perl' and 'capture' don't seem to address my problem... or maybe they do and I'm too stupid to realize it.

is it always ha16 or there could be other patterns or it is always 2 letters and 2 digits ? — Prix, Aug 08 '13 at 23:26
@Prix - I'd like a general solution, but in this particular case I have two patterns I need to search, capture, and replace: `ha16(identifier)` --> `identifier@ha` and `lo16(identifier)` --> `identifier@l`. (No, that's not a typo, the 2nd conversion drops the 'o' in 'lo'.) The 2nd conversion can have characters following it on the same line that must be preserved, but the first does not. — phonetagger, Aug 08 '13 at 23:30

score 3 · Accepted Answer · edited May 23 '17 at 12:28

3

You could do it like this:

#!/usr/bin/perl

use strict;
use warnings;

my $text = 'lis r3, ha16(L_.str10) some more text blah lis r3, lo16(identifier) some more text blah lis r3, ot16(identifier)';
$text =~ s/(\w{2})\d{2}\(([\w\.]+)\)/$1 eq 'lo' ? $2 . '@l' : $2 . '@' . $1/gie;
print $text;

Can also be written as:

#!/usr/bin/perl

use strict;
use warnings;
while (<DATA>) {
     s/(\w{2})\d{2}\(([\w\.]+)\)/$1 eq 'lo' ? $2 . '@l' : $2 . '@' . $1/gie;
     #you can also print out the result of the replacement.
     #print $_;
}

__DATA__
lis r3, ha16(L_.str10) 
some more text blah lis r3, lo16(identifier) 
some more text blah lis r3, ot16(identifier)

To put it simple the e modifier allows you to use code on the right hand of the regex that can be used to replace the pattern. For a more detailed explanation you can read this question.

On this example I am using (\w{2})\d{2} to match the extension before the label inside parenthesis and grouping the 2 letters for later use and using ([\w\.]+) which means any alphanumeric characters plus underscore and dot, to match your label.

On the right hand I am doing a ternary operator to define the extension:

$1 eq 'lo' ? $2 . '@l' : $2 . '@' . $1

if the first element which is the 2 letters is equal to lo then use @l if not then use the 2 letters as @extension for instance @ha or @ot on my sample text.

Live DEMO.

edited May 23 '17 at 12:28

Community

1
1

answered Aug 08 '13 at 23:19

Prix

19,417
15
73
132

This is an awesome solution, thank you very much. My finished script is at: https://eval.in/41852 You've expanded my perl regex capabilities 4000% in only 40 minutes. – phonetagger Aug 09 '13 at 00:14
@phonetagger glad it worked for u, I've been playing with the `e` modifier its pretty badass. – Prix Aug 09 '13 at 00:15
@phonetagger by the way should this one also get fixed ? `-8(r1)` – Prix Aug 09 '13 at 00:16
No, the `ha16()`, `hi16()`, and `lo16()` modifiers are the LLVM assembler's dialect for what the GCC assembler uses: `@ha`, `@hi`, and `@l`. They specify the portion of the label/identifier's address that should go into the register, since in RISC code you can't load the entire 32 bits in a single instruction. So `@ha` and `@hi` load the top 16 bits, while `@l` load the low 16 bits. The `-8(r1)` isn't a label/identifer specifying an address, it's just the contents of memory at offset -8 from register r1. – phonetagger Aug 09 '13 at 00:33
@phonetagger I see, thanks for letting me know, that looked familiar but never crossed my mind it could be that haha. – Prix Aug 09 '13 at 00:34
BTW, this is for PowerPC. To be perfectly correct, those are the LLVM assembler's dialect for PowerPC in Darwin mode. In Linux-Gnu mode, it outputs the GCC style modifiers, but I'm trying to compile for Darwin and then test the resulting assembly on my PPC Linux box. (Messed up, I know.) – phonetagger Aug 09 '13 at 00:42

score 2 · Answer 2 · 2013-08-09T00:07:08.910

2

I think this can be improved into one line, but this is how I would do it:

$val = "lis r3, ha16(L_.str10)";
if ($val =~ /ha16\((.*?)\)/) {
    # $1 now contains the extracted text
    $capture = $1;
    $val =~ s/ha16\(.*?\)/$capture\@ha/gi;
}

Explanation of the regex involved:

ha16\((.*?)\)

ha16\( basically says "any text starting with ha16(". The ( is escaped since it is a regex keyword

(.*?) The () mean "capture everything that matches the pattern inside of this. .*? says "match zero or more (that's the *) of any character (that's the .) the ? means to do it non-greedily

\) says "once you get to this point, stop matching" (this is because of the non-greedy ? we used)

And the replacement:

s/ha16\(.*?\)/$1\@ha/gi

Anything in this format: s/<something>/<something>/ will tell perl to do a find and replace. The $1 is the match from the first set of parenthesis (if there were more then one we would have a $2 and so forth). The gi at the end says to replace GLOBALLY (don't stop after replacing the first match), and do it case-INSENSITIVE.

edited Aug 09 '13 at 00:07

answered Aug 08 '13 at 23:22

Why does this: https://eval.in/41801 not work? It wipes out the identifier "_globvar". – phonetagger Aug 08 '13 at 23:51
Aha... Apparently you can't use $1 inside of a regex because it's a regex metachar (http://stackoverflow.com/questions/3848221/how-do-i-use-a-perl-variable-in-a-regular-expression) I've updated the code to reflect this. – Aug 09 '13 at 00:06
1

You can just use `\1` – hwnd Aug 09 '13 at 00:16
@RobbertWijtman - Also a good solution, once fixed with your edit. – phonetagger Aug 09 '13 at 00:38

hwnd · Answer 3 · 2013-08-09T01:11:42.807

2

Something like..

use strict;
use warnings;

while (<>) {
     s/ha16\((.+)\)/$1\@ha/gi;
     print;
}

or better yet, use a mapping for multiple occurrences of variations.

my %map = (
    ha => '@ha',
    hi => '@hi',
    lo => '@l'
);

while (<>) {
   s/(\w{2})16\((.+)\)/$2$map{$1}/gi;
   print;
}

Swith off greediness using ?, the . matches almost any character, + means one or more.

edited Aug 09 '13 at 01:11

answered Aug 08 '13 at 23:32

hwnd

69,796
4
95
132

Search-and-replace regex with capture

3 Answers3