-1

I have a big continuous text with characters like {, },//,; and white spaces in between. I want to read this text and write into new line wherever it finds these characters.

Input text is like :

apple{{mango } guava ; banana; // pear      berry;}

Expected formatted output data should be as shown in image

apple
{
{
mango
}
guava ;
banana;
// pear
berry;
}

I want to do this in perl.Thanks in advance.

Toto
  • 89,455
  • 62
  • 89
  • 125
Sumit
  • 277
  • 1
  • 5
  • 19

4 Answers4

4

Of course you will have to adapt this for your needs (most notably loop while reading lines), but here is a way to do it that doesn't (really) rely on regexes. As others have said, this is a starting point, you may adapt to what you need.

#!/usr/bin/perl
use strict;
use warnings;

my $string = 'apple{{mango } guava ; banana; // pear      berry;}';
my $new_string = join("\n", grep {/\S/} split(/(\W)/, $string));

print $new_string . "\n";

This splits the line into an array, splitting on non-word characters but keeps the element. Then greps for non-whitespace characters (removing array elements which contain whitespace). Then joins the array elements with newline characters into one string. From what your specification says you need // together, I leave that as an exercise to the reader.

Edit: After looking at your request again, it looks like to have a specific but complicated structure that you are trying to parse. To do it correctly you may have to use something more powerful like the Regexp::Grammars module. It will take some learning, but you can define a very complicated set of parsing instructions to do exactly whatever you need.

Edit 2: Since I have been looking for a reason to learn more about Regexp::Grammars, I took this opportunity. This is a basic example that I came up with. It prints the parsed data structure to a file named "log.txt". I know it doesn't look like the structure you asked for, but it contains all of that information and may be reconstituted however you like. I did so with a recursive function that is basically the opposite of the parser.

#!/usr/bin/env perl
use strict;
use warnings;

use Data::Dumper;
use Regexp::Grammars;

my $grammar = qr{
  <nocontext:>
  <Line>
  <rule: Line>      <[Element]>*
  <rule: Element>   <Words> | <Block> | <Command> | <Comment>
  <rule: Command>   <[Words]> ;
  <rule: Block>     \{ <[Element]>* \}
  <rule: Comment>   // .*? \s{2,}        #/ Syntax Highlighter fix
  <rule: Words>     (?:\b\w+\b) ** \s
}x;

my $string = 'apple{{mango kiwi } guava ; banana; // pear      berry;}';

if ($string =~ $grammar) {
  open my $log, ">", "log.txt";
  print $log Dumper \%/; #/

  print elements($/{Line}{Element});

} else {
  die "Did not match";
}

sub elements {
  my @elements = @{ shift() };
  my $indent = shift || 0;
  my $output;

  foreach my $element (@elements) {
    $output .= "\t" x $indent;

    foreach my $key (keys %$element) {
      if ($key eq 'Words') {
        $output .= $element->{$key} . "\n";
      } elsif ($key eq 'Block') {
        $output .= "{\n" . elements($element->{$key}->{Element}, $indent + 1) . ("\t" x $indent) . "}\n";
      } elsif ($key eq 'Comment') {
        $output .= $element->{$key} . "\n";
      } elsif ($key eq 'Command') {
        $output .= join(" ", @{ $element->{$key}->{Words} }) . ";\n";
      } elsif ($key eq 'Element') {
        $output .= elements($element->{$key}, $indent + 1);
      }
    }
  }

  return $output;
}

Edit 3: In light of the comments from the OP, I have adapted the above example to allow for multiple words on the same line, as of right now those words can only be separated by one space. I also made comments match anything that starts in // and ends in two or more spaces. Also since I was making changes, and since I believe this to be a code pretty-printer, I added tabbing to the block formatter. If this isn't desired it should be easy enough to strip away. Go now and learn Regexp::Grammars and make it fit your specific case. (I know I should have made the OP do even this change, but I am enjoying learning it as well)

Edit 4: One more thing, if in fact you are trying to recover useful code from serialized to a single line code, your only real problem is extracting the line comments and separating them from the useful code (assuming you are using a whitespace ignoring language which it looks as though you are). If that is the case, then perhaps try this variation on my original code:

#!/usr/bin/perl
use strict;
use warnings;

my $string = 'apple{{mango } guava ; banana; // pear      berry;}';
my $new_string = join("\n", split(/((?:\/\/).*?\s{2,})/, $string));

print $new_string . "\n";

whose output is

apple{{mango } guava ; banana; 
// pear      
berry;}
Joel Berger
  • 20,180
  • 5
  • 49
  • 104
  • it works except it breaks into new line at each word.... not at just only special characters.....any suggestions... – Sumit Jun 12 '11 at 16:08
  • @Sumit, I assumed that what you were doing is parsing something more complicated. You will have to adapt the rules to accommodate the structure that you have. Even your example doesn't just split on special characters; the comment mechanism `// words (multiple spaces)` is one line. This framework will allow you to declare the different types of structures that you intend to match logically. If I have over-thought this then I apologize, however, I am going to guess that you are pretty printing some programming language and you will NEED to do better than splitting on special characters. – Joel Berger Jun 13 '11 at 01:43
  • @Sumit, you can add a rule like ` <[Word]>+` and make the Element rule ` | ...`. This will keep multiple words together in a block. You will have to amend the recursive function to handle this as well. Note: this is not tested and as I look at it, might break the Command rule, which you might need to change to ` ;` and possibly some other cleanup. You should be able to adapt it yourself from here. Its not easy I grant you, but it is powerful, flexible and logical. Once you get it right it will serve you well. – Joel Berger Jun 13 '11 at 01:48
  • @Sumit, I have updated the example, because I was interested in the solution, from here on out, use/adapt it to meet your needs. – Joel Berger Jun 13 '11 at 02:44
  • @joel ..thx for correcting me there...yes it does not break at //......im trying to learn regexp:grammar.... – Sumit Jun 13 '11 at 13:32
3

Your specification sucks. Sometimes you want a newline before and after. Sometimes you want a newline after. Sometimes you want a newline before. You have "pear" and "berry" on separate lines, but it does not meet any of the conditions in your spec.

The quality of an answer is directly proportional to the care given in composing the question.

With a careless question, you are likely to get a careless answer.

#!/usr/bin/perl
use warnings;
use strict;

$_ = 'apple{{mango } guava ; banana; // pear      berry;}';

s#([{}])#\n$1\n#g; # curlies
s#;#;\n#g;         # semicolons
s#//#\n//#g;       # double slashes
s#\s\s+#\n#g;      # 2 or more whitespace
s#\n\n#\n#g;       # no blank lines

print;
tadmc
  • 3,714
  • 16
  • 14
  • totally agree with you.his question lacks clarity. – yb007 Jun 11 '11 at 12:45
  • There is long blank space between pear and berry....iam new to perl....i guess this sort of question may sound very basic to you...i appreciate your help anyways.. – Sumit Jun 11 '11 at 13:37
  • @Sumit: Basic questions are great. But poorly thought out ones aren't. You need to define your specification in a clear and concise way before it can be adapted to code. There's nothing advanced about expressing oneself methodically and clearly. – DavidO Jun 11 '11 at 16:05
  • @ DavidO This text is going to be big lines instead of just worlds like mango or apple.....its like reading a big single line of text. Checking for character by character for such ; { } . When it happens print the read text into new line and print this special character in new line. Further go on to read the line till such happens again... – Sumit Jun 11 '11 at 16:06
1

Not exactly what you want, but imho for the start will be enough:

echo 'apple{{mango } guava ; banana; // pear      berry;}' |\
perl -ple 's/(\b\w+\b)/\n$1\n/g'

will produce:

apple
{{
mango
 } 
guava
 ; 
banana
; // 
pear

berry
;}

You can start improving it...

clt60
  • 62,119
  • 17
  • 107
  • 194
  • @ jm666 Thx This will work...how does -ple 's/(\b\w+\b)/\n$1\n/g' this work...... – Sumit Jun 11 '11 at 13:55
  • @jm666 ...this will not be exactly the same text in actual file....in big file i have to read text untill characters like ; { } appear ...then print the text read untill this point into new line.....again go on to read further till it happens next time again and continue the loop – Sumit Jun 11 '11 at 14:01
  • @jm666 This text is going to be big lines instead of just worlds like mango or apple.....its like reading a big single line of text. Checking for character by character for such ; { } . When it happens print the read text into new line and print this special character in new line. Further go on to read the line till such happens again... – Sumit Jun 11 '11 at 14:06
  • 1
    simple tell what characters should be as "line delimiter", so before what characters you want break the line. – clt60 Jun 11 '11 at 22:49
  • @jm666 logically what you said sounds exactly what i want..can you suggest....i triead something like: for $line () { while ($line =~ /(.)/g) { $line =~ s/;/\n/ ; $line =~ s/}/\n/ ; } } It breaks text into new line every time it finds such characters but misses out this character.... – Sumit Jun 12 '11 at 15:40
  • @Sumit, you would want to include the character in the substitution: `for $line () { while ($line =~ /(.)/g) { $line =~ s/;/;\n/ ; $line =~ s/}/}\n/ ; } }`, but why are you doing that ugly `while ($line =~ /(.)/g) {` business? – Joel Berger Jun 13 '11 at 02:51
1

As you said this is not homework, something like the following comes to mind:

my $keeps  = qr#(//\s+\w+)#;            #special tokens to keep  (e.g., // perl)
my $breaks = qr#(\s+|\[|\]|\{|\})#;     #simple tokens to split words at

while ( my $text = <> )
{
    @tokens = grep /\S/, split( qr($keeps|$breaks), $text );
    print join(".\n.", @tokens ), "\n";
}

You will have to work out the actual rules yourself.

Gilbert
  • 3,740
  • 17
  • 19