1

I have a long text file that I want to mostly remain the same, but certain phrases need translated. It's not exactly a clean search-and-replace... For example, I need to change each occurrence of this...

lis r3, ha16(aLabel)

...into this:

lis r3, aLabel@ha

I.e. I need to find the whole ha16(aLabel), capture the aLabel from it (which could be any identifier text up to the terminating end-paren), and then emit a replacement of the captured text followed by @ha.

I've found examples galore of perl search-and-replace, but I haven't come across anything quite like what I need, and other posts that mention 'perl' and 'capture' don't seem to address my problem... or maybe they do and I'm too stupid to realize it.

Prix
  • 19,417
  • 15
  • 73
  • 132
phonetagger
  • 7,701
  • 3
  • 31
  • 55
  • is it always ha16 or there could be other patterns or it is always 2 letters and 2 digits ? – Prix Aug 08 '13 at 23:26
  • @Prix - I'd like a general solution, but in this particular case I have two patterns I need to search, capture, and replace: `ha16(identifier)` --> `identifier@ha` and `lo16(identifier)` --> `identifier@l`. (No, that's not a typo, the 2nd conversion drops the 'o' in 'lo'.) The 2nd conversion can have characters following it on the same line that must be preserved, but the first does not. – phonetagger Aug 08 '13 at 23:30

3 Answers3

3

You could do it like this:

#!/usr/bin/perl

use strict;
use warnings;

my $text = 'lis r3, ha16(L_.str10) some more text blah lis r3, lo16(identifier) some more text blah lis r3, ot16(identifier)';
$text =~ s/(\w{2})\d{2}\(([\w\.]+)\)/$1 eq 'lo' ? $2 . '@l' : $2 . '@' . $1/gie;
print $text;

Can also be written as:

#!/usr/bin/perl

use strict;
use warnings;
while (<DATA>) {
     s/(\w{2})\d{2}\(([\w\.]+)\)/$1 eq 'lo' ? $2 . '@l' : $2 . '@' . $1/gie;
     #you can also print out the result of the replacement.
     #print $_;
}

__DATA__
lis r3, ha16(L_.str10) 
some more text blah lis r3, lo16(identifier) 
some more text blah lis r3, ot16(identifier)

To put it simple the e modifier allows you to use code on the right hand of the regex that can be used to replace the pattern. For a more detailed explanation you can read this question.

On this example I am using (\w{2})\d{2} to match the extension before the label inside parenthesis and grouping the 2 letters for later use and using ([\w\.]+) which means any alphanumeric characters plus underscore and dot, to match your label.

On the right hand I am doing a ternary operator to define the extension:

$1 eq 'lo' ? $2 . '@l' : $2 . '@' . $1

if the first element which is the 2 letters is equal to lo then use @l if not then use the 2 letters as @extension for instance @ha or @ot on my sample text.

Live DEMO.

Community
  • 1
  • 1
Prix
  • 19,417
  • 15
  • 73
  • 132
  • This is an awesome solution, thank you very much. My finished script is at: https://eval.in/41852 You've expanded my perl regex capabilities 4000% in only 40 minutes. – phonetagger Aug 09 '13 at 00:14
  • @phonetagger glad it worked for u, I've been playing with the `e` modifier its pretty badass. – Prix Aug 09 '13 at 00:15
  • @phonetagger by the way should this one also get fixed ? `-8(r1)` – Prix Aug 09 '13 at 00:16
  • No, the `ha16()`, `hi16()`, and `lo16()` modifiers are the LLVM assembler's dialect for what the GCC assembler uses: `@ha`, `@hi`, and `@l`. They specify the portion of the label/identifier's address that should go into the register, since in RISC code you can't load the entire 32 bits in a single instruction. So `@ha` and `@hi` load the top 16 bits, while `@l` load the low 16 bits. The `-8(r1)` isn't a label/identifer specifying an address, it's just the contents of memory at offset -8 from register r1. – phonetagger Aug 09 '13 at 00:33
  • @phonetagger I see, thanks for letting me know, that looked familiar but never crossed my mind it could be that haha. – Prix Aug 09 '13 at 00:34
  • BTW, this is for PowerPC. To be perfectly correct, those are the LLVM assembler's dialect for PowerPC in Darwin mode. In Linux-Gnu mode, it outputs the GCC style modifiers, but I'm trying to compile for Darwin and then test the resulting assembly on my PPC Linux box. (Messed up, I know.) – phonetagger Aug 09 '13 at 00:42
2

I think this can be improved into one line, but this is how I would do it:

$val = "lis r3, ha16(L_.str10)";
if ($val =~ /ha16\((.*?)\)/) {
    # $1 now contains the extracted text
    $capture = $1;
    $val =~ s/ha16\(.*?\)/$capture\@ha/gi;
}

Explanation of the regex involved:

ha16\((.*?)\)

ha16\( basically says "any text starting with ha16(". The ( is escaped since it is a regex keyword

(.*?) The () mean "capture everything that matches the pattern inside of this. .*? says "match zero or more (that's the *) of any character (that's the .) the ? means to do it non-greedily

\) says "once you get to this point, stop matching" (this is because of the non-greedy ? we used)

And the replacement:

s/ha16\(.*?\)/$1\@ha/gi

Anything in this format: s/<something>/<something>/ will tell perl to do a find and replace. The $1 is the match from the first set of parenthesis (if there were more then one we would have a $2 and so forth). The gi at the end says to replace GLOBALLY (don't stop after replacing the first match), and do it case-INSENSITIVE.

  • Why does this: https://eval.in/41801 not work? It wipes out the identifier "_globvar". – phonetagger Aug 08 '13 at 23:51
  • Aha... Apparently you can't use $1 inside of a regex because it's a regex metachar (http://stackoverflow.com/questions/3848221/how-do-i-use-a-perl-variable-in-a-regular-expression) I've updated the code to reflect this. –  Aug 09 '13 at 00:06
  • 1
    You can just use `\1` – hwnd Aug 09 '13 at 00:16
  • @RobbertWijtman - Also a good solution, once fixed with your edit. – phonetagger Aug 09 '13 at 00:38
2

Something like..

use strict;
use warnings;

while (<>) {
     s/ha16\((.+)\)/$1\@ha/gi;
     print;
}

or better yet, use a mapping for multiple occurrences of variations.

my %map = (
    ha => '@ha',
    hi => '@hi',
    lo => '@l'
);

while (<>) {
   s/(\w{2})16\((.+)\)/$2$map{$1}/gi;
   print;
}

Swith off greediness using ?, the . matches almost any character, + means one or more.

hwnd
  • 69,796
  • 4
  • 95
  • 132