1

I'm kind of stumped on this one. I'm trying to parse a file that has data like so:

"1111 Some random descriptive text can have numbers and letters",
// :property1.Some description
// :property2.A different description
// :property3.Yet another
"2222 More random text here",
// :property1.Some description
// :property1.A different description
// :property2.Yet another description
// :property3.Yet another

I'm going to parse this and create html files.

I currently have it in array after doing:

@array = <FILE>;

#Put it in a single long string:
$long_string = join("",@array);

#Then trying to split it with the following regex:
@split_array = split(/\"\d{4}.+",/,$long_string);

I'm planning to somehow to save the match string and correlate it with the property fields somehow...

Just really doubting my methods now..

Toto
  • 89,455
  • 62
  • 89
  • 125
perlfoo
  • 11
  • 2
  • Why do you find it challenging? And what king of HTML do you have to produce? Post your code, with an example of the desired output, otherwise people cannot help you. To get you started, you can see that the lines have some regular patterns: (1) lines starting with `"` and ending with `"'`, and (2) lines starting with `// :property` – MarcoS May 27 '11 at 19:28

1 Answers1

1

When parsing text, you need to identify the critical leverage points that help you distinguish one piece of information from another. Here's what I see in your text:

  • Each line is a distinct unit.

  • Some lines begin with // and others don't.

  • There is some regularity at the beginnings of lines, but a lot of variability in the rest.

By slurping-and-joining the document into a single string, you are weakening those points of leverage.

Another key parsing strategy is to break things down into simple, easily understood steps. Here, too, the run-one-regex-against-a-giant-string strategy is often the wrong direction.

This is how I would start:

use strict;
use warnings;

open(my $file_handle, '<', 'input_file_name') or die $!;

while (my $line = <$file_handle>){
    if ( $line =~ /^\"(\d+)/ ){
        my $number = $1;
        ...
    }
    else {
        ...
    }
}
FMc
  • 41,963
  • 13
  • 79
  • 132