0

I've made a translator in perl for a messageboard migration, All I do is applying regexes and print the result. I write stdout to a file and here we go ! But the problem is that my program won't work after 18 MB written !

I've made a translate.pl ( https://gist.github.com/914450 ) and launch it with this line : $ perl translate.pl mydump.sql > mydump-bbcode.sql

Really sorry for quality of code but I never use perl... I tried sed for same work but didn't manage to apply the regex I found in original script.

[EDIT] I reworked the code and sanitized some regexes (see gist.github.com/914450) but I'm still stuck. When I splited the big dump in 15M files, I launched translate.pl 7(processes) by 7 to use all cores but the script stops at a variable size. a "tail" command doesn't show a complex message on any url when it stops...

Thanks Guys ! I let you know if I manage finally

Dextair
  • 1
  • 1
  • What will hapen, when you remove anything but print `print` from loop? Do you get equal files? – w.k Apr 12 '11 at 08:10
  • Since you are parsing one line at a time from your sql dump file, many of your regexps will not match if a tag happens to span multiple lines (which is perfectly valid HTML). It really depends on how your sql dump file is formatted. If there is one INSERT statement per line (with escaped linebreaks within your HTML content), then you should be okay to proceed with your strategy. – Sam Choukri Apr 12 '11 at 09:00

5 Answers5

1

yikes - start with the basics:

use strict;
use warnings;

..at the top of your script. It will complain about not properly declaring your lexicals, so go ahead and do that. I don't see anything obvious that would be truncating your file, but perhaps one or more of your regexes is pathological. Also, the undefs at the end are not needed.

For what you are doing, you might consider just using sed

Mike Ellery
  • 2,054
  • 3
  • 21
  • 30
  • Hello and Thank you, I added those two lines and declared $html and $file as my(); but it doesn't progress. I wasn't able to translate regexes for sed (tried and almost lost and hour) – Dextair Apr 11 '11 at 23:05
1

You say the "script stops". It keeps running but produces no more output? Or actually stops running? If it stops running, what does:

perl translate.pl mydump.sql > mydump-bbcode.sql
echo $?

show? And if you add a print STDERR "done!\n"; after your loop, does that show up?

ysth
  • 96,171
  • 6
  • 121
  • 214
  • in fact, the script doesn't stops gently but I just see the size ($ watch ls -lah myfile) doesn't increase passed some time ! the time vary following the file (got 25 dumps of 15 MB) – Dextair Apr 13 '11 at 21:47
  • @Dextair: so the script keeps going endlessly? how long have you waited? maybe try installing the pv utility and running `pv mydump.sql | perl translate.pl - >mydump-bbcode.sql` and see what it shows? – ysth Apr 13 '11 at 21:59
0

Perl can certainly handle files much larger than 18 MB. I know because I routinely run files of 5 GB through Perl.

I think that your problem is in while($html=<FILE>).

Whenever $html is set to an empty line the while will evaluate as False and exit the loop.

You need to use something like while( defined( $html = <FILE> ) )

Edit:

Hmm. I had always thought you need the defined but in my testing just now it didn't exit on blank lines or 0. Must be more of that special Perl magic that mostly works the way you intend -- except when it doesn't.

Indeed if you restructure the while loop enough you can fool Perl into working the way I always thought it worked. (And it might have, in Perl 4 or in earlier versions of Perl 5)

This will fail:

$x = <>;
chomp $x;
while( $x ) {
    print $x;
    $x = <>;
    chomp $x;
}
Zan Lynx
  • 53,022
  • 10
  • 79
  • 131
  • 2
    [Not true.](http://stackoverflow.com/questions/3773917/whats-the-most-defensive-way-to-loop-through-lines-in-a-file-with-perl) – CanSpice Apr 11 '11 at 22:16
  • 1
    A blank line is "\n" which is true. Paranoia about adding the defined() (which is actually added for you by perl when it encounters the `while ( VAR=READLINE )` pattern) is only needed for the case of a file with a trailing line `0` with no newline. – ysth Apr 11 '11 at 23:18
  • @ysth: No, that’s not so. If you deparse the while loop, you’ll see that the compiler is Your Friend. – tchrist Apr 12 '11 at 00:40
  • @tchrist: in the simple case, yes, but that's what I already said. ?? – ysth Apr 12 '11 at 01:58
  • @ysth: the "`0` with no newline" case is handled properly by Perl. See the link I posted. – CanSpice Apr 12 '11 at 16:04
  • @CanSpice: I said "which is actually added for you by perl...". But it is not added all the time, for instance `while ( $not_found and my $line = <> )` – ysth Apr 12 '11 at 16:26
  • @ysth: Yeah, you were right the first time. I blame lack of coffee. :-) – CanSpice Apr 12 '11 at 16:29
0

There could be any number of things going on:

  1. Try adding $| = 1; to the top of your script. This will make all output unbuffered.
  2. One of your regexes is going crazy and is deleting strings when you're not expecting it.
  3. You've run out of disk space.

There's nothing really wrong with your script (other than you're missing use strict; use warnings; and you're not using the three-argument form of open()) that would cause it to stop working after some magic number of bytes.

CanSpice
  • 34,814
  • 10
  • 72
  • 86
  • Thank you, I added $| = 1 and nothing moved. but for the record, I splited the big dump in 25 splits of 15m and I made a bash for parallelizing 8 times translate.pl. each process is stuck at a variable size so I don't think it's a buffer problem after all – Dextair Apr 11 '11 at 23:08
0

Hello guys and Thank you so much for your help and ideas ! After trying to cut and parallelize the jobs, I tried to cut my program in 3 programs, translate1.pl, translate2.pl and 3... the job is done, and it's fast by 8 active cores !

then my launcher.sh starts successively the 3 scripts for each splitted file. done with 2 loops and here we go :)

Regards, Yoann

Dextair
  • 1
  • 1