2
   title: Football team: Real Madrid stadium: Santiago Bernabeu players: Zinédine Zidane, Ronaldo, Luís Figo, Roberto Carlos, Raúl personnel: José Mourinho (head coach) Aitor Karanka (assistant coach (es))

How to split this with perl in:

   title: Football
   team: Real Madrid
   stadium: Santiago Bernabeu
   players: Zinédine Zidane Ronaldo Luís Figo Roberto Carlos Raúl
   personnel: José Mourinho (head coach) Aitor Karanka (assistant coach (es))
Axeman
  • 29,660
  • 2
  • 47
  • 102
user935420
  • 113
  • 1
  • 7

5 Answers5

7

Use a lookahead assertion:

say for split /(?=\w+:)/, $real_madrid_string;

Output

title: Football
team: Real Madrid
stadium: Santiago Bernabeu
players: Zinédine Zidane Ronaldo Luís Figo Roberto Carlos Raúl
personnel: José Mourinho (head coach) Aitor Karanka (assistant coach (es))
Zaid
  • 36,680
  • 16
  • 86
  • 155
  • If translate "players" in another language: "players" = "jucător", zero-width lookahead found character "ă" instead of colon ":" (a word character that is interpreted a non word charachter) and split here. Thanks. – user935420 Sep 12 '11 at 10:53
  • Then your Perl version is not new enough to directly support this. You can try to split on `\P{Letter}` instead, but I guess you will also need to wiggle with Perl options to get it into an UTF8 kind of mood, maybe with Perl `-CSD`. Maybe one or the other will suffice. – tripleee Sep 12 '11 at 11:32
2

This should do it. line.txt contains "title: Football team: Real Madrid stadium: Santiago Bernabeu players: Zinédine Zidane, Ronaldo, Luís Figo, Roberto Carlos, Raúl personnel: José Mourinho (head coach) Aitor Karanka (assistant coach (es))"

#!/usr/bin/perl
use strict;
use warnings;

my $fn="./line.txt";

open(IN,$fn);
my @lines=<IN>;

my %hash;
my $hashKey;

foreach my $line (@lines){
        $line=~s/\n//g;
        my @split1=split(" +",$line);
        foreach my $split (@split1){
                if($split=~m/:$/){
                        $hashKey=$split;
                }else{
                        if(defined($hash{$hashKey})){
                                $hash{$hashKey}=$hash{$hashKey}.$split." ";
                        }else{
                                $hash{$hashKey}=$split." ";
                        }
                }
        }
}

close(IN);


foreach my $key (keys %hash){
        print $key.":".$hash{$key}."\n";
}
Eamorr
  • 9,872
  • 34
  • 125
  • 209
1

Contrary to what many are saying in their answers, you do not need lookahead (other than the Regex's own), you would only need to capture part of the delimiter, like so:

my @hash_fields = grep { length; } split /\s*(\w+):\s*/;

My full solution below:

my %handlers
    = ( players   => sub { return [ grep { length; } split /\s*,\s*/, shift ]; }
      , personnel => sub { 
            my $value = shift;
            my %personnel;
            # Using recursive regex for nested parens
            while ( $value =~ m/([^(]*)([(](?:[^()]+|(?2))*[)])/g ) {
                my ( $name, $role ) = ( $1, $2 );
                $role =~ s/^\s*[(]\s*//;
                $role =~ s/\s*[)]\s*$//;
                $name =~ s/^\s+//;
                $name =~ s/\s+$//;
                $personnel{ $role } = $name;
            }
            return \%personnel;
        }
      );
my %hash = grep { length; } split /(?:^|\s+)(\w+):\s+/, <DATA>;
foreach my $field ( keys %handlers ) { 
    $hash{ $field } = $handlers{ $field }->( $hash{ $field } );
}

Dump looks like this:

%hash: {
     personnel => {
                    'assistant coach (es)' => 'Aitor Karanka',
                    'head coach' => 'José Mourinho'
                  },
     players => [
                  'Zinédine Zidane',
                  'Ronaldo',
                  'Luís Figo',
                  'Roberto Carlos',
                  'Raúl'
                ],
     stadium => 'Santiago Bernabeu',
     team => 'Real Madrid',
     title => 'Football'
   }
Axeman
  • 29,660
  • 2
  • 47
  • 102
  • $value =~ m/([^(]*)([(](?:[^()]+|(?2))*[)])/g Undefined (?...) sequence. – user935420 Sep 12 '11 at 17:13
  • @user935420, don't know what problem you're having with it. In my strawberry perl 5.12 and ActivePerl 5.14, it works without a hitch. – Axeman Sep 12 '11 at 18:03
0

The best way is to use the split command using a zero-width lookahead:

$string = "title: Football team: Real Madrid stadium: Santiago Bernabeu players: Zinédine Zidane, Ronaldo, Luís Figo, Roberto Carlos, Raúl personnel: José Mourinho (head coach) Aitor Karanka (assistant coach (es))";

@split_string = split /(?=\b\w+:)/, $string;
Nathan Fellman
  • 122,701
  • 101
  • 260
  • 319
0
$string = "title: Football team: Real Madrid stadium: Santiago Bernabeu players: Zinédine Zidane, Ronaldo, Luís Figo, Roberto Carlos, Raúl personnel: José Mourinho (head coach) Aitor Karanka (assistant coach (es))";
@words = split(' ', $string);

@lines = undef;
@line = shift(@words);
foreach $word (@words)
{
    if ($word =~ /:/)
    {
        push(@lines, join(' ', @line));
        @line = undef;
    }
    else
    {
        push(@line, $word);
    }
}

print join("\n", @lines);
Bwmat
  • 4,314
  • 3
  • 27
  • 42
  • This won't work, because Perl doesn't have the concept of an array of arrays. the first `push` will simply concatenate the contents of `@line` onto the end of `@lines`. In order for this to work, `@lines` has to be an array of *references* to arrays, generated by `@line`. – Nathan Fellman Sep 12 '11 at 10:23
  • @lines is an array of strings, I only ever push strings into it – Bwmat Sep 12 '11 at 10:25
  • It is often a good idea to run your code before you post it. This won't run at all. I can see a missing semi-colon for starters. `push` takes an array as it's first argument, and you probably meant to concatenate there. But even then it begs the question, why go round the long way? – Zaid Sep 12 '11 at 10:32
  • ahh, I always get confused about the order of the arguments to push. As for why, I'm new to perl and didn't think about lookahead – Bwmat Sep 12 '11 at 10:35
  • @Zaid: I wouldn't fault Bwmat for not thinking of the lookahead. After all, there's more than one way to to it. – Nathan Fellman Sep 12 '11 at 11:49
  • Me neither. @Bwmat I too was in your shoes at one time. Just remember that it is very easy to overlook the simple-yet-powerful features provided by Perl in lieu of C/Java-style coding conventions. – Zaid Sep 12 '11 at 12:02