3

I would like to append a string in perl within a loop in a fast way, without having to copy the string for each iteration. I'm looking for something like StringBuilder from Java or C#.

I currently know the following alternatives in mind, in order to do 'a += b'.

  1. a .= b # concat
  2. a = join('', a, b); # join
  3. push @a, b # array push

I am not interested in copying all string to the other. I need to copy one character per time, or append small strings foreach iteration. I am trying to solve the following problem: compress the input string 'aaabbccc' to '3a2b3c'. So the idea is to iterate over the input string, check how many repeated characters we have, and then append to the output in the compressed way. What is the most efficient to perform this in perl ?

Here is a link to the problem I was trying to solve. I's slightly different though.

André Pontes
  • 447
  • 1
  • 4
  • 13

2 Answers2

4

For comparsion, I tried to test different versions for solving your actual problem of compressing the string. Here is my test script test.pl:

use strict;
use warnings;

use Benchmark qw(cmpthese);
use Inline C => './compress_c.c';

my $str_len = 10000;
my @chars = qw(a b c d);
my $str;
$str .= [@chars]->[rand 4] for 1 .. $str_len;

cmpthese(
    -1,
    {
        compress_array => sub { compress_array( $str ) },
        compress_regex => sub { compress_regex( $str ) },
        compress_str   => sub { compress_str( $str ) },
        compress_c     => sub { compress_c( $str ) },
    }
);

# Suggested by @melpomene in the comments   
sub compress_regex {
    return $_[0] =~ s/([a-z])\1+/($+[0] - $-[0]) . $1/egr;
}

sub compress_array {
    my $result = '';

    my @chrs = split //, $_[0];

    my $prev = $chrs[0];
    my $count = 1;
    my @result;
    for my $i ( 1..$#chrs ) {
        my $char = $chrs[$i];
        if ( $prev eq $char ) {
            $count++;
            next if $i < $#chrs;
        }
        if ( $count > 1) {
            push @result, $count, $prev;
        }
        else {
            push @result, $prev;
        }
        if ( ( $i == $#chrs ) and ( $prev ne $char ) ) {
            push @result, $char;
            last;
        }
        $count = 1;
        $prev = $char;
    }

    return join '', @result;
}

sub compress_str {
    my $result = '';
    my $prev = substr $_[0], 0, 1;
    my $count = 1;
    my $lastind = (length $_[0]) - 1;
    for my $i (1 .. $lastind) {
        my $char = substr $_[0], $i, 1;
        if ( $prev eq $char ) {
            $count++;
            next if $i < $lastind;
        }

        if ( $count > 1) {
            $result .= $count;
        }
        $result .= $prev;
        if ( ( $i == $lastind ) and ( $prev ne $char ) ) {
            $result .= $char;
            last;
        }
        $count = 1;
        $prev = $char;
    }

    return $result;
}

where compress_c.c is:

SV *compress_c(SV* str_sv) {
    STRLEN len;
    char* str = SvPVbyte(str_sv, len);

    SV* result = newSV(len);
    char *buf = SvPVX(result);

    char prev = str[0];
    int count = 1;
    int j = 0;
    int i;
    for (i = 1; i < len; i++ )
    {
    char cur = str[i];
        if ( prev == cur ) {
            count++;
            if ( i < (len - 1) )
                continue;
        }

        if ( count > 1) {
            buf[j++] = count + '0';  // assume count is less than 10
        }

        buf[j++] = prev;
        if ( (i == (len - 1)) && (prev != cur) ) buf[j++] = cur;
        count = 1;
        prev = cur;
    }

    buf[j] = '\0';
    SvPOK_on(result);
    SvCUR_set(result, j);
    return result;
}

The result of running perl test.pl:

                  Rate compress_array  compress_str compress_regex    compress_c
compress_array   311/s             --          -42%           -45%          -99%
compress_str     533/s            71%            --            -6%          -98%
compress_regex   570/s            83%            7%             --          -98%
compress_c     30632/s          9746%         5644%          5273%            --

Which shows that regex version is slightly faster than the string version. However, the C version is the fastest, and it is about 50 times as fast as the regex version.

Note: I tested this on my Ubuntu 16.10 laptop (Intel Core i7-7500U CPU @ 2.70GHz)

ikegami
  • 367,544
  • 15
  • 269
  • 518
Håkon Hægland
  • 39,012
  • 21
  • 81
  • 174
  • @ikegami Thanks for the edits and the comments! I was wondering how to get rid of the `malloc` call so I did not have to allocate the buffer twice. I have some questions: 1. Is `SvGETMAGIC(str_sv)` necessary when we are using `SvPVbyte(str_sv, len)`? According to [`perlapi`](http://perldoc.perl.org/perlapi.html) `SvPV()` handles get magic. 2. Why the braced block after `SvGETMAGIC()`? I cannot see any reason to add a local scope here. 3. What do you mean by *"Removed Unicode bug"* (comment in the answer revision history)? – Håkon Hægland Feb 20 '17 at 08:33
  • ... 4. Also the comment *"Fixed a compile error (`int` inside of `for`)"*. It should be valid to declare variables in the `for` loop specification in C99, see [this answer](http://stackoverflow.com/a/1287867/2173773). – Håkon Hægland Feb 20 '17 at 08:33
  • 1
    I've [commited here](https://github.com/andrepontesmelo/CtCI-6th-Edition-Perl/tree/master/Chapter1/6_String_Compression) the benchmarks for the original problem that I was trying to solve. It's a slightly different problem. I've got same results though. – André Pontes Feb 20 '17 at 10:39
  • 1
    1. Oops, you're right about `SvPVbyte` already doing `SvGETMAGIC`. 2. Depending on the version of C, you can't mix declarations and code. Var declarations can only be found at the start of a block or after another var declaration. 3. I (effectively) replaced SvPV with SvPVbytes. You can't use SvPV without separately checking the `UTF8` flag. 4. Switching `gcc` to C99 mode requires a switch that wasn't used when my `perl` was built. – ikegami Feb 20 '17 at 17:46
2

I've performed the following benchmark in several ways to perform that:

#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(cmpthese);

my $dna;
$dna .= [qw(G A T C)]->[rand 4] for 1 .. 10000;

sub frequency_concat {
    my $result = '';

    for my $idx (0 .. length($dna) - 1) {
            $result .= substr($dna, $idx, 1);
    }

    return $result;
 }

 sub frequency_join {
    my $result = '';

    for my $idx (0 .. length($dna) - 1) {
            $result = join '', $result, substr($dna,$idx,1);
    }

    return $result;
}

sub frequency_list_push {
       my @result = ();

       for my $idx (0 .. length($dna) - 1) {
               push @result, substr($dna,$idx,1);
       }

       return join '', @result;
 }

 sub frequency_list_prealloc {
            my @result = (' ' x length($dna));

            for my $idx (0 .. length($dna) - 1) {
                    $result[$idx] = substr($dna,$idx,1);
            }

            return join '', @result;
 }


cmpthese(-1, # Run each for at least 1 second(s)   {
               concat => \&frequency_concat,
               join => \&frequency_join,
               list_push => \&frequency_list_push,
               list_list_prealloc => \&frequency_list_prealloc
       }
   );

The results below have shown that the concat (a . b) is the fastest operation. I don't understand why, since this will need to make several copies of the string.

                    Rate         join   list_push list_list_prealloc          concat
join               213/s           --        -38%               -41%        -74%
list_push          342/s          60%          --                -5%        -58%
list_list_prealloc 359/s          68%          5%                 --        -56%
concat             822/s         285%        140%               129%          --
André Pontes
  • 447
  • 1
  • 4
  • 13
  • 3
    Your "list prealloc" case doesn't preallocate an array. It allocates a single element, which is a long string. – melpomene Feb 19 '17 at 20:29
  • 1
    IMHO, would be better benchmark the **whole** problem - e.g, counting the number of occurences of a character in a string. – clt60 Feb 19 '17 at 20:42
  • I tested this on my Ubuntu 16.10 laptop (Intel Core i7-7500U CPU @ 2.70GHz), and got similar results. I removed the call to `substr` and replaced it with a constant 1 char string instead. Now, `concat` was the fastest with 3229/s. I also implemented a C version with preallocation (using [`Inline::C`](https://metacpan.org/pod/distribution/Inline-C/lib/Inline/C.pod)) that was aproximately 90 times faster than the `concat` version (284350/s). The reason I removed the `substr` call was to be able to more easily compare the C version with the Perl versions. – Håkon Hægland Feb 19 '17 at 21:37
  • melpomene: Now I see. ' ' x length($dna) is one string. I've fixed it with (' ') x length($dna), which would prealloc the array. Same order of benchmark results. – André Pontes Feb 19 '17 at 22:19