Split string using SUBSTR or SPLIT?

Question

I'm at a loss and hoping to find help here. What I'm trying to accomplish is the following: I have a .csv file with 8 columns. The third column contains phone numbers formatted like so:

+45 23455678
+45 12314425
+45 43631678
+45 12345678
(goes on for a while)

What I want is:

+45 2345 5678
+45 1231 4425
+45 4363 1678
+45 1234 5678
(etc)

So just a whitespace after the 8th position (inc the + and whitespace). I've tried various things but it's not working. First I tried it with substr but couldn't get it to work. Then looked at the split function. And then I got confused! I'm new to perl so I'm not sure what I'm looking for but I've tried everything. There's 1 condition, all the numbers begin with (let's say) +45 and then a whitespace and a block of numbers. But not all the numbers have the same length, some have more than 10 digits. What I want it to do is take the first bit "+45 1234" (/+43\s{1}\d{4}/) and then the second part no matter how many digits it has. I figured setting LIMIT to 1 so it just adds the last bit no matter if its 4 digits or 8 long.

I've read http://www.perlmonks.org/?node_id=591988, but the part "Using split versus Regular Expressions" got me confused.

I've been trying for 3 days now and not getting anywhere. I guess it should be simple but I'm just now getting to know the basics of perl. I do have an understanding of regular expression but I don't know what statement to use for a certain task. This is my code:

@ARGV or die "Usage: $0  input-file output-file\n";

$inputfile=$ARGV[0];
$outputfile=$ARGV[1];

open(INFILE,$inputfile) || die "Bestand niet gevonden :$!\n";
open(OUTFILE,">$outputfile") || die "Bestand niet gevonden :$!\n";

$i = 0;

@infile=<INFILE>;

foreach ( @infile ) {
    $infile[$i] =~ s/"//g;                            
    @elements = split(/;/,$infile[$i]);         

    @split = split(/\+43\s{1}\d{4}/, $elements[2], 1);

    @split = join ???

    @elements = join(";",@elements);            # Add ';' to all elements
    print OUTFILE "@elements";
    $i = $i+1;
}

close(INFILE);
close(OUTFILE);

`use strict;` and `use warnings;`! [Use lexical file handlers and use open(3)](http://stackoverflow.com/a/616524/367180). — matthias krull, Jun 19 '12 at 10:53
Split is not appropriate for adding the space, you have nothing to split on. Perl programmers will naturally go for a match/replace (see answers). You could substring it: $new = substr($old, 0, 7) . " " . substr($old, 8); should do it. — Bill Ruppert, Jun 19 '12 at 11:43
I do realise my code is far from perfect. As i mentioned before i'm still learling most functions. Thank you al for taking a look at my problem! I'm working on improving my code and myself right now;) I find some aspects of perl to be very difficult! I still have a lot to learn i see.. — Jan, Jun 19 '12 at 12:17

flesk · Answer 1 · 2012-06-19T11:24:42.903

There are several issues with your code, but to address your question on how to add a space after the 8th position in a string, I'm going to assume you have stored your phone numbers in an array @phone_numbers. This is a task well suited for a regex:

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;

my @phone_numbers = (
    '+45 23455678',
    '+45 12314425',
    '+45 43631678',
    '+45 12345678'
);

s/^(.{8})/$1 / for @phone_numbers;

print Dumper \@phone_numbers;

Output:

$VAR1 = [
      '+45 2345 5678',
      '+45 1231 4425',
      '+45 4363 1678',
      '+45 1234 5678'
    ];

To apply the pattern to your script, just add:

$elements[2] =~ s/^(.{8})/$1 /;

or alternatively

my @chars = split//, $elements[2];
splice @chars, 8, 0, ' ';
$elements[2] = join"", @chars;

to alter phone numbers within your foreach loop.

IMO those are clumsy compared to `substr $elements[2], 8, 0, ' '` — Borodin, Jun 19 '12 at 15:05
@Borodin: I can't argue with that. I got so caught up in split versus regex that I completely forgot about `substr`. — flesk, Jun 19 '12 at 18:57

Bill Ruppert · Accepted Answer · 2012-06-19T12:45:36.080

Here is a more idiomatic version of your program.

use strict;
use warnings;

my $inputfile  = shift || die "Need input and output file names!\n";
my $outputfile = shift || die "Need an output file name!\n";

open my $INFILE,  '<', $inputfile   or die "Bestand niet gevonden :$!\n";
open my $OUTFILE, '>', $outputfile  or die "Bestand niet gevonden :$!\n";

my $i = 0;

while (<$INFILE>) {
    # print; # for debugging
    s/"//g;
    my @elements = split /;/, $_;
    print join "%", @elements;
    $elements[2] =~ s/^(.{8})/$1 /;
    my $output_line = join(";", @elements);
    print $OUTFILE $output_line;
    $i = $i+1;
}

close $INFILE;
close $OUTFILE;

exit 0;

Oops, left in a debugging print after the while. – Bill Ruppert Jun 19 '12 at 12:44 — Bill Ruppert, Jun 19 '12 at 12:44

score 0 · Answer 3 · answered Jun 19 '12 at 11:54

0

use substr on left hand side:

use strict;
use warnings;

while (<DATA>) {
    my @elements = split /;/, $_;
    substr($elements[2], 8, 0) = ' ';
    print join(";", @elements);
}

__DATA__
col1;col2;+45 23455678
col1;col2;+45 12314425
col1;col2;+45 43631678
col1;col2;+45 12345678

output:

col1;col2;+45 2345 5678
col1;col2;+45 1231 4425
col1;col2;+45 4363 1678
col1;col2;+45 1234 5678

answered Jun 19 '12 at 11:54

Toto

89,455
62
89
125

Equivalent to, but noisier than `substr $elements[2], 8, 0, ' '` – Borodin Jun 19 '12 at 16:52

score 0 · Answer 4 · 2012-06-25T08:24:42.803

0

Perl one liner which you can use for multiple .csv files also.

perl -0777 -i -F/;/ -a -pe "s/(\+45\s\d{4})(\d+.*?)/$1 $2/ for @F;$_=join ';',@F;" s_infile.csv

edited Jun 25 '12 at 08:24

answered Jun 19 '12 at 12:07

1

You can use `-F/;/ -a` and `@F`, and use `-p` and say `$_ = join ';', @F` skipping the print. – TLP Jun 19 '12 at 12:17

score 0 · Answer 5 · answered Jun 19 '12 at 12:13

This is the basic gist of how its done. The "prefix" to the numeric string is \+45, which is hard coded, and you may change it as needed. \pN means numbers, {4} means exactly 4.

use strict;
use warnings;

while (<DATA>) {
    s/^\+45 \pN{4}\K/ /;
    print;
}

__DATA__
+45 234556780
+45 12314425
+45 436316781
+45 12345678

Your code has numerous other problems:

You do not use use strict; use warnings;. This is a huge mistake. It's like riding a motorcycle and protecting your head by putting on a blindfold instead of a helmet. Often, it is an easy piece of advice to overlook, because it is explained very briefly, so I am being more verbose than I have to in order to make a point: This is the most important thing wrong. If you miss all the rest of your errors, it's better than if you miss this part.

Your open statements are two-argument, and you do not verify your arguments in any way. This is very dangerous, because it allows people to perform arbitrary commands. Use the three-argument open with a lexical file handle and explicit MODE for open:

open my $in, "<", $inputfile or die $!;

You slurp the file into an array: @infile=<INFILE> The idiomatic way to read a file is:

while (<$in>) {  # read line by line
    ...
}

What's even worse, you loop with foreach (@infile), but refer to $infile[$i] and keep a variable counting upwards in the loop. This is mixing two styles of loops, and even though it "works", it certainly looks bad. Looping over an array is done either:

for my $line ( @infile ) {  # foreach style
    $line =~ s/"//g;
    ...
}

for my $index ( 0 .. $#infile ) { # array index style
    $infile[$index] =~ ....
}

But neither of these two loops are what you should use, since the while loop above is much preferred. Also, you don't actually have to use this method at all. The *nix way is to supply your input file name or STDIN, and redirect STDOUT if needed:

perl script.pl inputfile > outputfile

or, using STDIN

some_command | perl script.pl > outputfile

To achieve this, just remove all open commands and use

while (<>) {  # diamond operator, open STDIN or ARGV as needed
    ...
}

However, in this case, since you are using CSV data, you should be using a CSV module to parse your file:

use strict;
use warnings;
use ARGV::readonly;  # safer usage of @ARGV file reading

use Text::CSV;

my $csv = Text::CSV->new({
        sep_char    => ";",
        eol     => $/,
        binary      => 1,
        });

while (my $row = $csv->getline(*DATA)) {  # read input line by line
    if (defined $row->[1]) {              # don't process empty rows
        $row->[1] =~ s/^\+45 *\pN{4}\K/ /;
    }
    $csv->print(*STDOUT, $row);
}

__DATA__
fooo;+45 234556780;bar
1231;+45 12314425;
oh captain, my captain;+45 436316781;zssdasd
"foo;bar;baz";+45 12345678;barbarbar

In the above script, you can replace the DATA file handle (which uses inline data) with ARGV, which will use all script argument as input file names. For this purpose, I added ARGV::readonly, which will force your script to only open files in a safe way.

As you can see, my sample script contains quoted semi-colons, something split would be hard pressed to handle. The specific print statement will enforce some CSV rules to your output, such as adding quotes. See the documentation for more info.

Thank you for taking the time for explaining it al, it's actually quite clear. This is very helpful to me! Sorry for all the faulty coding. I am learning from code that was already written so that would explain a few things. It al makes a lot of sense and i'm trying improve my code. — Jan, Jun 19 '12 at 14:38

score 0 · Answer 6 · answered Jun 19 '12 at 14:59

To add a space after the eighth character of a string you can use the fourth parameter of substr.

substr $string, 8, 0, ' ';

replaces a zero-length substring starting at offset 8 with a single space.

You may think it's safer to use regular expressions so that only data in the expected format is changed

$string =~ s/^(\+\d{2} \d{4})/$1 /;

or

$str =~ s/^\+\d{2} \d{4}\K/ /;

will achieve the same thing, but will do nothing if the number doesn't look as it should beforehand.

Here is a reworking of your program. Most importantly you should use strict and use warnings at the start of your program, and declare variables with my at the point of their first use. Also use the three-paramaeter form of open and lexical filehandles. Lastly it is best to avoid reading an entire file into an array when a while loop will let you process it a line at a time.

use strict;
use warnings;

@ARGV == 2 or die "Usage: $0 input-file output-file\n";

my ($inputfile, $outputfile) = @ARGV;

open my $in, '<', $inputfile or die "Bestand niet gevonden: $!";
open my $out, '>', $outputfile or die "Bestand niet gevonden: $!";

while (<$in>) {
  tr/"//d;                            
  my @elements = split /;/;
  substr $elements[2], 8, 0, ' ';
  print $out join ';', @elements;
}

Split string using SUBSTR or SPLIT?

6 Answers6