Of course you will have to adapt this for your needs (most notably loop while reading lines), but here is a way to do it that doesn't (really) rely on regexes. As others have said, this is a starting point, you may adapt to what you need.
#!/usr/bin/perl
use strict;
use warnings;
my $string = 'apple{{mango } guava ; banana; // pear berry;}';
my $new_string = join("\n", grep {/\S/} split(/(\W)/, $string));
print $new_string . "\n";
This splits the line into an array, splitting on non-word characters but keeps the element. Then greps for non-whitespace characters (removing array elements which contain whitespace). Then joins the array elements with newline characters into one string. From what your specification says you need //
together, I leave that as an exercise to the reader.
Edit:
After looking at your request again, it looks like to have a specific but complicated structure that you are trying to parse. To do it correctly you may have to use something more powerful like the Regexp::Grammars
module. It will take some learning, but you can define a very complicated set of parsing instructions to do exactly whatever you need.
Edit 2:
Since I have been looking for a reason to learn more about Regexp::Grammars
, I took this opportunity. This is a basic example that I came up with. It prints the parsed data structure to a file named "log.txt". I know it doesn't look like the structure you asked for, but it contains all of that information and may be reconstituted however you like. I did so with a recursive function that is basically the opposite of the parser.
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use Regexp::Grammars;
my $grammar = qr{
<nocontext:>
<Line>
<rule: Line> <[Element]>*
<rule: Element> <Words> | <Block> | <Command> | <Comment>
<rule: Command> <[Words]> ;
<rule: Block> \{ <[Element]>* \}
<rule: Comment> // .*? \s{2,} #/ Syntax Highlighter fix
<rule: Words> (?:\b\w+\b) ** \s
}x;
my $string = 'apple{{mango kiwi } guava ; banana; // pear berry;}';
if ($string =~ $grammar) {
open my $log, ">", "log.txt";
print $log Dumper \%/; #/
print elements($/{Line}{Element});
} else {
die "Did not match";
}
sub elements {
my @elements = @{ shift() };
my $indent = shift || 0;
my $output;
foreach my $element (@elements) {
$output .= "\t" x $indent;
foreach my $key (keys %$element) {
if ($key eq 'Words') {
$output .= $element->{$key} . "\n";
} elsif ($key eq 'Block') {
$output .= "{\n" . elements($element->{$key}->{Element}, $indent + 1) . ("\t" x $indent) . "}\n";
} elsif ($key eq 'Comment') {
$output .= $element->{$key} . "\n";
} elsif ($key eq 'Command') {
$output .= join(" ", @{ $element->{$key}->{Words} }) . ";\n";
} elsif ($key eq 'Element') {
$output .= elements($element->{$key}, $indent + 1);
}
}
}
return $output;
}
Edit 3: In light of the comments from the OP, I have adapted the above example to allow for multiple words on the same line, as of right now those words can only be separated by one space. I also made comments match anything that starts in //
and ends in two or more spaces. Also since I was making changes, and since I believe this to be a code pretty-printer, I added tabbing to the block formatter. If this isn't desired it should be easy enough to strip away. Go now and learn Regexp::Grammars
and make it fit your specific case. (I know I should have made the OP do even this change, but I am enjoying learning it as well)
Edit 4: One more thing, if in fact you are trying to recover useful code from serialized to a single line code, your only real problem is extracting the line comments and separating them from the useful code (assuming you are using a whitespace ignoring language which it looks as though you are). If that is the case, then perhaps try this variation on my original code:
#!/usr/bin/perl
use strict;
use warnings;
my $string = 'apple{{mango } guava ; banana; // pear berry;}';
my $new_string = join("\n", split(/((?:\/\/).*?\s{2,})/, $string));
print $new_string . "\n";
whose output is
apple{{mango } guava ; banana;
// pear
berry;}