An automated way to do string extraction for Perl/Mason i18n?

Question

I'm currently working on internationalizing a very large Perl/Mason web application, as a team of one (does that make this a death march??). The application is nearing 20 years old, and is written in a relatively old-school Perl style; it doesn't use Moose or another OO module. I'm currently planning on using Locale::Maketext::Gettext to do message lookups, and using GNU Gettext catalog files.

I've been trying to develop some tools to aid in string extraction from our bigass codebase. Currently, all I have is a relatively simple Perl script to parse through source looking for string literals, prompt the user with some context and whether or not the string should be marked for translation, and mark it if so.

There's way too much noise in terms of strings I need to mark versus strings I can ignore. A lot of strings in the source aren't user-facing, such as hash keys, or type comparisons like

if (ref($db_obj) eq 'A::Type::Of::Db::Module')

I do apply some heuristics to each proposed string to see whether I can ignore it off the bat (ex. I ignore strings that are used for hash lookups, since 99% of the time in our codebase these aren't user facing). However, despite all that, around 90% of the strings my program shows me are ones I don't care about.

Is there a better way I could help automate my task of string extraction (i.e. something more intelligent than grabbing every string literal from the source)? Are there any commercial programs that do this that could handle both Perl and Mason source?

ALSO, I had a (rather silly) idea for a superior tool, whose workflow I put below. Would it be worth the effort implementing something like this (which would probably take care of 80% of the work very quickly), or should I just submit to an arduous, annoying, manual string extraction process?

Start by extracting EVERY string literal from the source, and putting it into a Gettext PO file.
Then, write a Mason plugin to parse the HTML for each page being served by the application, with the goal of noting strings that the user is seeing.
Use the hell out of the application and try to cover all use cases, building up a store of user facing strings.
Given this store of strings the user saw, do fuzzy matches against strings in the catalog file, and keep track of catalog entries that have a match from the UI.
At the end, anything in the catalog file that didn't get matched would likely not be user facing, so delete those from the catalog.

score 4 · Answer 1 · answered Oct 04 '11 at 20:21

There are no Perl tools I know of which will intelligently extract strings which might need internationalization vs ones that will not. You're supposed to mark them in the code as you write them, but as you said that wasn't done.

You can use PPI to do the string extraction intelligently.

#!/usr/bin/env perl

use strict;
use warnings;

use Carp;
use PPI;

my $doc = PPI::Document->new(shift);

# See PPI::Node for docs on find
my $strings = $doc->find(sub {
    my($top, $element) = @_;
    print ref $element, "\n";

    # Look for any quoted string or here doc.
    # Does not pick up unquoted hash keys.
    return $element->isa("PPI::Token::Quote")   ||
           $element->isa("PPI::Token::HereDoc");
});

# Display the content and location.
for my $string (@$strings) {
    my($line, $row, $col) = @{ $string->location };
    print  "Found string at line $line starting at character $col.\n";
    printf "String content: '%s'\n", string_content($string);
}


# *sigh* PPI::Token::HereDoc doesn't have a string method
sub string_content {
    my $string = shift;
    return $string->isa("PPI::Token::Quote")   ? $string->string :
           $string->isa("PPI::Token::HereDoc") ? $string->heredoc :
           croak "$string is neither a here-doc nor a quote";
}

You can do more sophisticated examination of the tokens surrounding the strings to determine if it's something significant. See PPI::Element and PPI::Node for more details. Or you can examine the content of the string to determine if it's significant.

I can't go much further because "significant" is up to you.

Thanks, Schwern. This is definitely a more robust way to identify strings in Perl source. I actually set out using PPI, however around 2/3 of the code I'm internationalizing is in Mason, and PPI was having some difficulties when trying to make a DOM out of said source. I ended up just doing it by hand with a regex to identify quote delimited tokens. More importantly, I think your answer made me realize that my question is kinda silly, in that it's improbable there's any good way to automagically determine strings that I would deem marking. Seems the only way is the old fashioned way. — hitstuff, Oct 04 '11 at 21:20
@hitstuff For a project like this that just requires a human to slog through with light technical knowledge, perhaps you can convince your company to hire/assign you an intern or hire a CS student to do the work. Provide them with the generated list of strings, possibly with a few lines before and after, and instruction what to look for. Have them mark which are significant and which are not. — Schwern, Oct 06 '11 at 07:14
I'd guess that sorting the strings would make it significantly easier to classify big chunks of them. — Ira Baxter, Oct 08 '11 at 17:03

Ira Baxter · Answer 2 · 2011-10-08T17:18:52.927

Our Source Code Search Engine is normally used to efficiently search large code bases, using indexes constructed from the lexemes of the languages it knows. That list of languages is pretty broad, including Java, C#, COBOL and ... Perl. The lexeme extractors are language precise (because they are "stolen" from our DMS Software Reengineering Toolkit, a language-agnostic program transformation system, where precision is fundamental).

Given an indexed code base, one can then enter queries to find arbitrary sequences of lexemes in spite of language-specific white space; one can log the hits of such queries and their locations.

The extremely short query:

to the Search Engine finds all lexical elements which are classified as strings (keywords, variable names, comments are all ignored; just strings!). (Normally people write more complex queries with regular expression constraints, such as S=*Hello to find strings that end with "Hello")

The relevance here is that the Source Code Search Engine has precise knowledge of lexical syntax of strings in Perl (including specifically elements of interpolated strings and all the wacky escape sequences). So the query above will find all strings in Perl; with logging on, you get all the strings and their locations logged.

This stunt actually works for any langauge the Search Engine understands, so it is a rather general way to extract the strings for such internationalization tasks.

An automated way to do string extraction for Perl/Mason i18n?

2 Answers2