0

I have a string, and a reference list of elements. I want to be able to split the string into another list of elements, taking the reference list into account. That means spliting the sentence based on reference or words. For example,

reference_list = ['10', '2 to 3', '1 and 1/2' '1/2', '1/22', ... ... etc]
my_list = "this happened at 10 o'clock and now after 2 to 3 hours has gone..meet 1 and 1/2 hours later. Visit us on 1/22 or 2/12/2012... etc.

Output should look like,

out = ["this", "happened", "at", "10", "o'clock", .... "2 to 3", "hours", ... ... "1 and 1/2", "hours", ... "1/22", "or", "2/12/2012... ]

I would appreciate any help. Thank you in advance.

Update:

I have tried this,

   reg = r'\b(%s|\w+)\b' % '|'.join(reference_list)
   print(reg)
   result = []
   for e in re.finditer(reg, sentence):
       result.append(e.group())
   
   print(result)

Doesn't work.

Droid-Bird
  • 1,417
  • 5
  • 19
  • 43
  • check : https://stackoverflow.com/questions/2136556/in-python-how-do-i-split-a-string-and-keep-the-separators – Kofi Dec 24 '21 at 02:52
  • When you give an example it's best to make it complete (i.e., no "...." or "etc.") and give the (complete) desired result, so that all readers can demonstrate how their suggested code works with the same example. – Cary Swoveland Dec 24 '21 at 06:09

2 Answers2

1

This is similar to the split strings and keep separators problem.

You could concatenate all of your reference_list strings into one regex and use that.

Then for the resulting list, you can split the results that aren't in the reference_list by spaces.

  • I have tried that approach but fails for cases like 1/2. I have updated the question. Please check? – Droid-Bird Dec 24 '21 at 03:32
  • It could've been the regex. It tries to capture either one of the reference strings or some whitespace. Depending on if it's greedy or not, it could just split everything by whitespace regardless. I'd first just try to get it to split by the reference strings first, then only after try to change your code to split by spaces. – Leif Messinger LOAF Dec 24 '21 at 11:09
0

Suppose we have the following data:

reference_list = ['10', '2', '1', '2 to 3', '1/2', '1 and 1/2',
                  '1/22', '2 to 3 to 4']
my_list = "this happened at 10 o'clock and now after 2 to 3 " +
          "to 4 hours has gone we've decided to meet on-time " +
          "1 and 1/2 hours later. Visit us on 1/22 or 2/12/2012"

(I have written the string this way so that it can be viewed without the need for horizontal scrolling.)

The key is to first sort reference_list to create a list new_list such that if new_list[j] is included in new_list[i] then i < j (though the opposite is generally not true.) With Ruby this could be done as follows.

new_list = reference_list.sort { |a,b| a.include?(b) ? -1 : 1 }
  #=> ["1/22", "1 and 1/2", "1/2", "2 to 3 to 4", "10", "1",
  #    "2 to 3", "2"]

I assume Python code would be similar.

Next we programmatically construct a regular expression from new_list. Again, this could be done as follows in Ruby, and I assume the Python code would be similar:

/\b(?:#{new_list.join('|')}|[\w'-]+)\b/
  #=> /\b(?:1\/22|1 and 1\/2|1\/2|2 to 3 to 4|10|1|2 to 3|2|[\w'-]+)\b/

If this regular expression is used with re.findall we obtain the following result:

["this", "happened", "at", "10", "o'clock", "and", "now", "after",
 "2 to 3 to 4", "hours", "has", "gone", "we've", "decided", "to",
 "meet", "on-time", "1 and 1/2", "hours", "later", "Visit", "us",
 "on", "1/22", "or", "2", "12", "2012"]

Python regex demo

Before any match has been made, and after each match has been made, findall attempts to match '1/22' at the current location in the string. If that fails to match it attempts to match '1 and 1\/2', and so on. Lastly, if all matches but the last fail it will attempt to match the catch-all [\w'-]+. I have arbitrarily included an apostrophe (so "o'clock" will be matched) and hyphen (so "on-time" will be matched). Notice that all matches must be preceded and followed by a word boundary (\b).

Notice that while '2 to 3 to 4' is matched by 2 to 3 to 4, 2 to 3 and 2, the ordering of the elements of the alternation ensure that first of these is the match that is made.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100