1

I have a string that I want to split into a list of certain types. For example, I want to split Starter Main Course Dessert to [Starter, Main Course, Dessert]

I cannot use split() because it will split up the Main Course type. How can I do the splitting? Is regex needed?

JJJ
  • 135
  • 8
  • You would have to know either the words or partial words, or the layout in order to do this.. – TheLazyScripter Feb 12 '17 at 17:25
  • What matches `Main Course` but not `Starter Main` or `Course Dessert` (from `Starter Main Course Dessert`)? This is impossible, AFAIK. –  Feb 12 '17 at 17:26
  • Yes I know the words that I want to split into, but I am not sure how to do it from the original string – JJJ Feb 12 '17 at 17:30
  • Maybe what you need requires 2-gram(bigram). In Python you can use `nltk`. [This](http://stackoverflow.com/questions/17531684/n-grams-in-python-four-five-six-grams) may be helpful. And [this](http://stackoverflow.com/questions/21844546/forming-bigrams-of-words-in-list-of-sentences-with-python) and [this](http://stackoverflow.com/questions/32441605/generating-ngrams-unigrams-bigrams-etc-from-a-large-corpus-of-txt-files-and-t) too. – Sangbok Lee Feb 12 '17 at 17:31
  • So you know the all the certain words that you want to keep together, right? – Usmiech Feb 12 '17 at 17:48

2 Answers2

3

If you have a list of acceptable words, you could use a regex union :

import re

acceptable_words = ['Starter', 'Main Course', 'Dessert', 'Coffee', 'Aperitif']
pattern = re.compile("("+"|".join(acceptable_words)+")", re.IGNORECASE)
# "(Starter|Main Course|Dessert|Coffee|Aperitif)"

menu = "Starter Main Course NotInTheList dessert"
print pattern.findall(menu)
# ['Starter', 'Main Course', 'dessert']

If you just want to specify which special substrings should be matched, you could use :

acceptable_words = ['Main Course', '\w+']
Eric Duminil
  • 52,989
  • 9
  • 71
  • 124
0

I think it's more practical to specify 'special' two-words tokens only.

special_words = ['Main Course', 'Something Special']
sentence = 'Starter Main Course Dessert Something Special Date'

words = sentence.split(' ')
for i in range(len(words) - 1):
    try:
        idx = special_words.index(str(words[i]) + ' ' + words[i+1])
        words[i] = special_words[idx]
        words[i+1] = None
    except ValueError:
        pass

words = list(filter(lambda x: x is not None, words))
print(words)
Sangbok Lee
  • 2,132
  • 3
  • 15
  • 33