6

I am currently trying to exctract acronmys from an bunch of documents.

Say the documents contains "Static application security testing (SAST)"

So I am trying to create a regex for filtering out these kind of strings. It should probably be something like

"a number of words whose initial letter is later repeated in the braces."

Unfortunately my regex is not very good to formulate this. Do you folks think it can be done via regex at all or do I need something more powerful like a CFG-based parser?

er4z0r
  • 4,711
  • 8
  • 42
  • 62
  • What language? This can be a fun little regex in .Net, but I'm not sure Java can handle it. The general answer is that it isn't possible using a regex, but **extremely easy** to do manually by looping through words, you don't really need a parser. – Kobi Jan 04 '11 at 12:10
  • 3
    Even if regex can do this, I'm not quite sure if it *belongs* into the best-done-via-regex domain. See [To use or not to use regular expressions?](http://stackoverflow.com/questions/4098086/to-use-or-not-to-use-regular-expressions/4098123#4098123). Finding a number of words followed by an all-caps no-space letter sequence in parens is easy and best-done-via-regex though. –  Jan 04 '11 at 12:13
  • Yikes, the first time I read through this my brain misregistered *anagram* for *acronym*! I don’t know that regexes are all that inappropriate for *acronyms* — the proffered solution seems pretty straightforward — but using one to generate *anagrams* would be tantamount to implementing polyphonic counterpoint on an inherently single‐threaded instrument like the violin. You’d have to either be mad, or a true master, to even attempt it (*viz.* BWV 1001–1006). – tchrist Jan 04 '11 at 17:29
  • I've solved it with [.Net groups for every length](http://kobikobi.wordpress.com/2011/01/04/net-regular-expressions-finding-acronyms-and-reversing-the-stack/), if anyone is interested. Just an exercise. – Kobi Jan 04 '11 at 21:20

2 Answers2

4

Try this (for 2 letter acronyms):

\b(\w)\w+\s+\b(\w)\w+\s+\(\1\2\)

This for 3 letter acronyms:

\b(\w)\w+\s+\b(\w)\w+\s+\b(\w)\w+\s+\(\1\2\3\)

This for 4 letter acronyms:

\b(\w)\w+\s+\b(\w)\w+\s+\b(\w)\w+\s+\b(\w)\w+\s+\(\1\2\3\4\)

Please note that the regex needs to be case insensitive.

BTW the Regex Coach is a nice tool for trying out stuff like this.

Helge Klein
  • 8,829
  • 8
  • 51
  • 71
  • I'll check it out. I have the tools from regular-expressions.info here. Just did not spend much brain-cells on the whole issue yet. – er4z0r Jan 04 '11 at 12:47
1

Here are two Perl solutions: The first one goes word by word, constructing an array made by the first leter of every word, then removes the acronym formed by those leters. It's fairly weak, and should fail if there's more than just the acronym and the letters per line - It also makes use of the (??{}) pattern to insert the acronym into the regex, which makes me queasy:

use strict;
use warnings;
use 5.010;

$_ = "Static application security testing (SAST)";

my @first;
s/
   \b
    (?<first>\p{L})\p{L}*
   \b
(?{ push @first, $+{first} })
  \K \s+ \(
    (??{ join '', map { uc } @first; })
    \)
//gx;

say;

Meanwhile, this solution first checks for something like an acronym, then constructs a regex to match as many words necessary: $_ = "Static application security testing (SAST)";

my ($possible_acronym) = /\((\p{Lu}+)\)/;
my $regex = join '', map({ qr/\b(?i:$_)\p{L}*\b\s*?/ } split //, $possible_acronym), qr/\K\Q($possible_acronym)/;
s/$regex//;

say;

(I tried making a solution using (?(DEFINE)) patterns, such as tchrist's answer here, but failed miserably. Oh well.)

For more about (?:), named captures (?), \K, and a whole bunch of swell stuff, perlre is the answer.

Community
  • 1
  • 1
Hugmeir
  • 1,249
  • 6
  • 9