0

I have the following string in a file and want to truncate the string to no more than 6 char. how to do that using regular expression in perl?
the original file is:

cat shortstring.in:

<value>1234@google.com</value>
<value>1235@google.com</value>

I want to get file as:
cat shortstring.out

<value>1234@g</value>
<value>1235@g</value>

I have a code as follows, is there any more efficient way than using
s/<value>(\w\w\w\w\w\w)(.*)/$1/;?

Here is a part of my code:

    while (<$input_handle>) {                        # take one input line at a time
            chomp;
            if (/(\d+@google.com)/) {
                    s/(<value>\w\w\w\w\w\w)(.*)</value>/$1/;
                    print $output_handle "$_\n";
              } else {
              print $output_handle "$_\n";
            }
    }
Greg Bacon
  • 134,834
  • 32
  • 188
  • 245
user399517
  • 3,543
  • 2
  • 19
  • 15
  • 1
    @ is not a word character so isn't matched by \w. Also, I think you don't mean to remove the `` part? – ysth Aug 03 '10 at 19:28

5 Answers5

10

Use this instead (regex is not the only feature of Perl and it's overkill for this: :-)

$str = substr($str, 0, 6);

http://perldoc.perl.org/functions/substr.html

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Hut8
  • 6,080
  • 4
  • 42
  • 59
5
$ perl -pe 's/(<value>[^<]{1,6})[^<]*/$1/' shortstring.in
<value>1234@g</value>
<value>1235@g</value>

In the context of the snippet from your question, use

while (<$input_handle>) {
  s!(<value>)(.*?)(</value>)!$1 . substr($2,0,6) . $3!e
    if /(\d+\@google\.com)/;
  print $output_handle $_;
}

or to do it with a single pattern

while (<$input_handle>) {
   s!(<value>)(\d+\@google\.com)(</value>)!$1 . substr($2,0,6) . $3!e;
  print $output_handle $_;
}

Using bangs as the delimiters on the substitution operator prevents Leaning Toothpick Syndrome in </value>.

NOTE: The usual warnings about “parsing” XML with regular expressions apply.

Demo program:

#! /usr/bin/perl

use warnings;
use strict;

my $input_handle = \*DATA;
open my $output_handle, ">&=", \*STDOUT or die "$0: open: $!";

while (<$input_handle>) {
   s!(<value>)(\d+\@google\.com)(</value>)!$1 . substr($2,0,6) . $3!e;
  print $output_handle $_;
}

__DATA__
<value>1234@google.com</value>
<value>1235@google.com</value>
<value>12@google.com</value>

Output:

$ ./prog.pl 
<value>1234@g</value>
<value>1235@g</value>
<value>12@goo</value>
Community
  • 1
  • 1
Greg Bacon
  • 134,834
  • 32
  • 188
  • 245
  • I think my code is not correct, I only want to truncate the data between – user399517 Aug 03 '10 at 19:30
  • your does not work. finally I use this: s/(.{1,$truncate_num}).*(<.*)/$1$2/; – user399517 Aug 04 '10 at 00:07
  • @gbacon thanks for the updated s!!!e sytnax. Someone else had posted then deleted that, but it didn't include the "" tags. Had never used s!!!e before and was curious on how that would have looked if done correctly. – David Blevins Aug 04 '10 at 00:10
  • @David Perl is flexible about the delimiters on the `s///` operator. Using bangs meant I didn't have to escape the slash in ``. – Greg Bacon Aug 04 '10 at 01:47
  • @lilili08 See updated answer for a working program that includes the code I suggested. What's the output you're seeing? – Greg Bacon Aug 04 '10 at 01:59
  • @gbacon The various separator parts I knew, the 'e' flag is the gem I've wanted several times and did not known how to do. Loving stackoverflow. – David Blevins Aug 04 '10 at 02:36
1

Looks like you want to truncate the text inside the tag which could be shorter than 6 characters already, in which case:

s/(<value>[^<]{1,6})[^<]*/$1/
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
David Blevins
  • 19,178
  • 3
  • 53
  • 68
1

Try this:

s|(?<=<value>)(.*?)(?=</value>)|substr $1,0,6|e;
Eugene Yarmash
  • 142,882
  • 41
  • 325
  • 378
0
s/<value>(.{1,6}).*/<value>$1</value>/;
Paul Tomblin
  • 179,021
  • 58
  • 319
  • 408
  • With the . in (.{1,6}) you could get stuff like '123 – David Blevins Aug 03 '10 at 19:49
  • @David, no, because he's already tested to make sure the tag has `@google.com`, so it can't be smaller than that. If you want to more careful, you could test for the closing tag, but since parsing xml or html in a regex is a REALLY REALLY BAD IDEA anyway, I don't want to give him any ideas. – Paul Tomblin Aug 03 '10 at 20:37