1

I have this regex.

$string =~ s/(?<!["\w])(\w+)(?=:)|(?<=:)([\w\d\\.+=\/]+)/"$1$2"/g;

The regex itself works fine.

But since I am am substituting alternatives (and globally), I always get warning that $1 or $2 is uninitialized. These warnings clutter my logfile.

What can I do better to avoid such warning? Or is my best option to just turn the warning off? I doubt this.

Side question: Is there possibly some better way of doing this, e.g. not using regex at all? What I am doing is fixing JSON where some key:value pairs do not have double quotes and JSON module does not like it when trying to decode.

okolnost
  • 113
  • 4
  • What if you replace `$1$2` with `$&`? BTW, your current pattern is invalid. – Wiktor Stribiżew Oct 27 '18 at 17:27
  • @WiktorStribiżew Yep, I just fixed the typo in regex, thanks for noticing. `$&` seems promising (I mean, definitely better than what I have now). Will using it produce warning if neither alternative is matched? If it does, any idea how to avoid that? – okolnost Oct 27 '18 at 17:35
  • 1
    `$&` is the whole match placeholder, and if you use it, remove all capturing groups as they become redundant (`s/(?<!["\w])\w+(?=:)|(?<=:)[\w\\.+=\/]+/"$&"/g`). If you need to use `$1`, use `s/(?|(?<!["\w])(\w+)(?=:)|(?<=:)([\w\\.+=\/]+))/"$1"/g` – Wiktor Stribiżew Oct 27 '18 at 17:37
  • 1
    Since it's not actually a problem that $1 or $2 is undefined (the warning is unfortunately misworded) and results in the empty string, you can also disable that warning in the lexical scope where you run that regex. `{ no warnings 'uninitialized'; $string =~ ...; }` – Grinnz Oct 27 '18 at 21:12

2 Answers2

3

There are a couple of approaches to get around this.

If you intend to use capture groups:

  • When capturing the entirety of each clause of the alternation.
    Combine the capture groups into 1 and move the group out.

     (                             # (1 start)
          (?<! ["\w] )
          \w+ 
          (?= : )
       |  
          (?<= : )
          [\w\d\\.+=/]+ 
     )                             # (1 end)
    

    s/((?<!["\w])\w+(?=:)|(?<=:)[\w\d\\.+=\/]+)/"$1"/g

  • Use a Branch Reset construct (?| aaa ).
    This will cause capture groups in each alternation to start numbering it's groups
    from the same point.

     (?|
          (?<! ["\w] )
          ( \w+ )                       # (1)
          (?= : )
       |  
          (?<= : )
          ( [\w\d\\.+=/]+ )             # (1)
     )
    

    s/(?|(?<!["\w])(\w+)(?=:)|(?<=:)([\w\d\\.+=\/]+))/"$1"/g

  • Use Named capture groups that are re-useable (Similar to a branch reset).
    In each alternation, reuse the same names. Make the group that isn't relevant, the empty group.
    This works by using the name in the substitution instead of the number.

        (?<! ["\w] )
        (?<V1> \w+ )                  # (1)
        (?<V2> )                      # (2)
        (?= : )
     |  
        (?<= : )
        (?<V1> )                      # (3)
        (?<V2> [\w\d\\.+=/]+ )        # (4)
    

    s/(?<!["\w])(?<V1>\w+)(?<V2>)(?=:)|(?<=:)(?<V1>)(?<V2>[\w\d\\.+=\/]+)/"$+{V1}$+{V2}"/g


The two concepts of the named substitution and a branch reset can be combined
if an alternation contains more than 1 capture group.
The example below uses the capture group numbers.

The theory is that you put dummy capture groups in each alternation to
"pad" the branch to equal the largest number of groups in a single alternation.

Indeed, this must be done to avoid the bug in Perl regex that could cause a crash.

 (?|                    # Branch Reset
                             # ------ Br 1 --------
      ( )                    # (1)
      ( \d{4} )              # (2)
      ABC294
      ( [a-f]+ )             # (3)
   |  
                             # ------ Br 2 --------          
      ( :: )                 # (1)
      ( \d+ )                # (2)
      ABC555
      ( )                    # (3)
   |  
                             # ------ Br 3 --------
      ( == )                 # (1)
      ( )                    # (2)
      ABC18888
      ( )                    # (3)
 )

s/(?|()(\d{4})ABC294([a-f]+)|(::)(\d+)ABC555()|(==)()ABC18888())/"$1$2$3"/g

1

You can try using Cpanel::JSON::XS's relaxed mode, or JSONY, to parse the almost-JSON and then write out regular JSON using Cpanel::JSON::XS. Depending what exactly is wrong with your input data one or the other might understand it better.

use strict;
use warnings;
use Cpanel::JSON::XS 'encode_json';

# JSON is normally UTF-8 encoded; if you're reading it from a file, you will likely need to decode it from UTF-8
my $string = q<{foo: 1,bar:'baz',}>;

my $data = Cpanel::JSON::XS->new->relaxed->decode($string);
my $json = encode_json $data;
print "$json\n";

use JSONY;
my $data = JSONY->new->load($string);
my $json = encode_json $data;
print "$json\n";
Grinnz
  • 9,093
  • 11
  • 18