I'm currently working on internationalizing a very large Perl/Mason web application, as a team of one (does that make this a death march??). The application is nearing 20 years old, and is written in a relatively old-school Perl style; it doesn't use Moose or another OO module. I'm currently planning on using Locale::Maketext::Gettext to do message lookups, and using GNU Gettext catalog files.
I've been trying to develop some tools to aid in string extraction from our bigass codebase. Currently, all I have is a relatively simple Perl script to parse through source looking for string literals, prompt the user with some context and whether or not the string should be marked for translation, and mark it if so.
There's way too much noise in terms of strings I need to mark versus strings I can ignore. A lot of strings in the source aren't user-facing, such as hash keys, or type comparisons like
if (ref($db_obj) eq 'A::Type::Of::Db::Module')
I do apply some heuristics to each proposed string to see whether I can ignore it off the bat (ex. I ignore strings that are used for hash lookups, since 99% of the time in our codebase these aren't user facing). However, despite all that, around 90% of the strings my program shows me are ones I don't care about.
Is there a better way I could help automate my task of string extraction (i.e. something more intelligent than grabbing every string literal from the source)? Are there any commercial programs that do this that could handle both Perl and Mason source?
ALSO, I had a (rather silly) idea for a superior tool, whose workflow I put below. Would it be worth the effort implementing something like this (which would probably take care of 80% of the work very quickly), or should I just submit to an arduous, annoying, manual string extraction process?
- Start by extracting EVERY string literal from the source, and putting it into a Gettext PO file.
- Then, write a Mason plugin to parse the HTML for each page being served by the application, with the goal of noting strings that the user is seeing.
- Use the hell out of the application and try to cover all use cases, building up a store of user facing strings.
- Given this store of strings the user saw, do fuzzy matches against strings in the catalog file, and keep track of catalog entries that have a match from the UI.
- At the end, anything in the catalog file that didn't get matched would likely not be user facing, so delete those from the catalog.