creating a regular expression for a list of strings

Question

I have extracted a series of tables from the scientific literature which consist of columns each of which is a distinct type. Here is an example table of values

I'd like to be able to automatically generate regular expressions for each column. Obviously there are trivial solutions such as .* so I would add the constraints that they use only:

[A-Z] [a-z] [0-9]
explicit punctuation (e.g. ',',''')
"simple" quantifiers (e.g {3,4}

A "best" answer for the table above would be:

 [A-Z]{3}
 [A-Za-z\s\.]+
 \d{4}\sm
 \d{2}\u00b0\d{2}'\d{2}"N,\d{2}\u00b0\d{2}'\d{2}"E
 (speciosissima|intermediate|troglodytes)
 (hf|sr)
 \d{4}

Of course the 4th regex would break if we move outside the geographical area but the software doesn't know that. The aim would be to collect many regexes for , say "Coordinates" and generalize them, probably partially manual. The enums would only be created if there were a small number of distinct strings.

I'd be grateful for examples of (especially F/OSS) software that can do this, especially in Java. (It's similar to Google's Refine). I am aware of this question 4 years ago but that didn't really answer the question and the text2re site which appears to be interactive.

NOTE: I note a vote to close as "too localised". This is a very general problem (the table given is only an example) as shown by Google/Freebase developing Refine to tackle the problem. It potentially refers to a very wide variety of tables (e.g. financial, journalism, etc.). Here's one with floating point values: enter image description here

It would be useful to determine automatically that some authorities report ages in real numbers (e.g. not months, days) and use 2 digits of precision.

Another "close" vote as "off topic". Given that the answer so far relates precisely to a programming technique, it seems clearly in scope. — peter.murray.rust, May 11 '13 at 22:39
@mark: my understanding is that this question is more about finding a model for each table column rather than necessarily using any particular regular expression package or, indeed, regular expressions at all. — Tikhon Jelvis, May 12 '13 at 01:18
This question is not "too localized" at all! If anything, it's quite the opposite and touches on a whole *research area*. This is especially odd considering that questions about how to write a *particular* regular expression are always welcome in this tag. — Tikhon Jelvis, May 12 '13 at 01:34
I have accepted @Tikhon as it is an excellent overview of the formal position and potential approach. I shall myself try a simple heuristic approach and give my own answer - comments welcome. — peter.murray.rust, May 13 '13 at 09:48

Tikhon Jelvis · Accepted Answer · 2013-05-11T22:35:57.023

Your particular issue is a special case of "programming by demonstration". That is, given a bunch of input/output examples, you want to generate a program. For you, the inputs are strings and the output is whether each string belongs to the given column. In the end, you want to generate a program in the language of limited regular expressions that you proposed.

This particular instance of programming by demonstration seems closely related to Flash Fill, a recent project from MSR. There, instead of generating regular expressions to match data, they automatically generated programs to transform string data based on input/output examples.

I only skimmed through one of their papers, but I'll try to lay out what I understand here.

There are basically two important insights in this paper. The first was to design a small programming language to represent string transformations. Even using full-on regular expressions created too many possibilities to search through quickly. They designed their own abstract language for manipulating strings; however, your constraints (e.g. only using simple quantifiers) would probably play the same role as their custom language. This is largely possible because your particular problem has a somewhat smaller scope than theirs.

The second insight was on how to actually find programs in this abstract language that match with given input/output pairs. My understanding is that the key idea here is to use a technique called version space algebra. The rough idea about version space algebra is that you maintain a representation of the space of possible programs and repeatedly prune it by introducing additional constraints. The exact details of this process fall well outside my main interests, so you're better off reading something like this introduction to version space algebra, which includes some sample code as well.

They also have some clever approaches to rank different candidate programs and even guess which inputs might be problematic for an already-generated program. I saw a demo where they generated a program without giving it enough input/output pairs, and the program could actually highlight new inputs that were likely to be incorrect. This sort of ranking is very interesting, but requires some more sophisticated machine learning techniques and is probably not immediately applicable to your use case. Might still be interesting though. (Also, this might have been detailed in a different paper than the one I linked.)

So yeah, long story short, you can generate your expressions by feeding input/output examples into a system based on version space algebra. I hope that helps.

+1 This certainly addresses the problem. (I am not tied to regex, but it is a succinct way of representing a solution space). Your link is probably overkill for what I want to do but looks "a" "correct" way to do it. If there were an implementation I might use it , but it's too much to write a system from scratch. — peter.murray.rust, May 11 '13 at 22:36
@peter.murray.rust: Yeah, I'm not sure that they have an open-source implementation. The feature *did* make it into the new version of Excel though, so you could play around with it there. — Tikhon Jelvis, May 11 '13 at 22:38
agreed. But it's very reassuring to know that there are formal methods and they are useful. — peter.murray.rust, May 11 '13 at 22:41

score 2 · Answer 2 · edited May 23 '17 at 12:03

2

I'm currently researching the same (or something similar) (here). In general, this is called Grammar induction, or in case of regular expressions, it is induction of regular languages. There is the StaMinA competition about this field. Common algorithms are RPNI and Blue-Fringe.

Here is another related question. And here another one. And here another one.

edited May 23 '17 at 12:03

Community

1
1

answered Feb 25 '14 at 16:02

Albert

65,406
61
242
386

peter.murray.rust · Answer 3 · 2013-05-13T10:05:33.003

My own approach (which I have partially prototyped) is heuristic and based on the premise that a given column will often have entries which are the same or similar character lengths and have similar punctuation. I would welcome comments (and resulting code will be Open Source).

flatten [A-Z] to 'A'
flatten [a-z] to 'a'
flatten [0-9] to '0'
flatten any other special codepoint sets (e.g. greek characters) to a single character (e.g. alpha)

The columns then become:

"AAA"
"Aaaaaaaaaa", "Aaaaaaaaaaaaa", "Aaa aaa Aaaaaa", etc.
"0000 a"
"00\u00b000'00"N,00\u00b000'00"E
...
...
"0000"

I shall then replace these by regular expressions such as

"([A-Z])([A-Z])([A-Z])"
...
"(\d)(\d)(\d)(\d)\s([0-9])"

and capture the individual characters into sets. This will show that (say) in 3. the final char is always "m" , so \d\d\d\d\s[m] and for 7. the value is [2][0][0][458].

For the columns that don't fit this model we search using "(.*)" and see if we can create useful sets (cols 5. and 6.) with a heuristic such as "at least 2 multiple strings and no more than 50% unique strings".

By using dynamic programming (cf. Kruskal) I hope to be able to align similar regexes, which will be useful for me, at least!

creating a regular expression for a list of strings

3 Answers3

Linked