Suppose we have the following data:
reference_list = ['10', '2', '1', '2 to 3', '1/2', '1 and 1/2',
'1/22', '2 to 3 to 4']
my_list = "this happened at 10 o'clock and now after 2 to 3 " +
"to 4 hours has gone we've decided to meet on-time " +
"1 and 1/2 hours later. Visit us on 1/22 or 2/12/2012"
(I have written the string this way so that it can be viewed without the need for horizontal scrolling.)
The key is to first sort reference_list
to create a list new_list
such that if new_list[j]
is included in new_list[i]
then i < j
(though the opposite is generally not true.) With Ruby this could be done as follows.
new_list = reference_list.sort { |a,b| a.include?(b) ? -1 : 1 }
#=> ["1/22", "1 and 1/2", "1/2", "2 to 3 to 4", "10", "1",
# "2 to 3", "2"]
I assume Python code would be similar.
Next we programmatically construct a regular expression from new_list
. Again, this could be done as follows in Ruby, and I assume the Python code would be similar:
/\b(?:#{new_list.join('|')}|[\w'-]+)\b/
#=> /\b(?:1\/22|1 and 1\/2|1\/2|2 to 3 to 4|10|1|2 to 3|2|[\w'-]+)\b/
If this regular expression is used with re.findall
we obtain the following result:
["this", "happened", "at", "10", "o'clock", "and", "now", "after",
"2 to 3 to 4", "hours", "has", "gone", "we've", "decided", "to",
"meet", "on-time", "1 and 1/2", "hours", "later", "Visit", "us",
"on", "1/22", "or", "2", "12", "2012"]
Python regex demo
Before any match has been made, and after each match has been made, findall
attempts to match '1/22'
at the current location in the string. If that fails to match it attempts to match '1 and 1\/2'
, and so on. Lastly, if all matches but the last fail it will attempt to match the catch-all [\w'-]+
. I have arbitrarily included an apostrophe (so "o'clock"
will be matched) and hyphen (so "on-time"
will be matched). Notice that all matches must be preceded and followed by a word boundary (\b
).
Notice that while '2 to 3 to 4'
is matched by 2 to 3 to 4
, 2 to 3
and 2
, the ordering of the elements of the alternation ensure that first of these is the match that is made.