I am working in Python 3.6 with NLTK 3.2.
I am trying to write a program which takes raw text as input and outputs any (maximum) series of consecutive words beginning with the same letter (i.e. alliterative sequences).
When searching for sequences, I want to ignore certain words and punctuation (for instance, 'it', 'that', 'into', ''s', ',', and '.'), but to include them in the output.
For example, inputting
"The door was ajar. So it seems that Sam snuck into Sally's subaru."
should yield
["so", "it", "seems", "that", "sam", "snuck", "into", "sally's", "subaru"]
I am new to programming and the best I could come up with is:
import nltk
from nltk import word_tokenize
raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru."
tokened_text = word_tokenize(raw) #word tokenize the raw text with NLTK's word_tokenize() function
tokened_text = [w.lower() for w in tokened_text] #make it lowercase
for w in tokened_text: #for each word of the text
letter = w[0] #consider its first letter
allit_str = []
allit_str.append(w) #add that word to a list
pos = tokened_text.index(w) #let "pos" be the position of the word being considered
for i in range(1,len(tokened_text)-pos): #consider the next word
if tokened_text[pos+i] in {"the","a","an","that","in","on","into","it",".",",","'s"}: #if it's one of these
allit_str.append(tokened_text[pos+i]) #add it to the list
i=+1 #and move on to the next word
elif tokened_text[pos+i][0] == letter: #or else, if the first letter is the same
allit_str.append(tokened_text[pos+i]) #add the word to the list
i=+1 #and move on to the next word
else: #or else, if the letter is different
break #break the for loop
if len(allit_str)>=2: #if the list has two or more members
print(allit_str) #print it
which outputs
['ajar', '.']
['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['snuck', 'into', 'sally', "'s", 'subaru', '.']
['sally', "'s", 'subaru', '.']
['subaru', '.']
This is close to what I want, except that I don't know how to restrict the program to only print the maximum sequences.
So my questions are:
- How can I modify this code to only print the maximum sequence
['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
? - Is there an easier way to do this in Python, maybe with regular expression or more elegant code?
Here are similar questions asked elsewhere, but which have not helped me modify my code:
- How do you effectively use regular expressions to find alliterative expressions?
- A reddit challenge asking for a similar program
- 4chan question regarding counting instances of alliteration
- Blog about finding most common alliterative strings in a corpus
(I also think it would be nice to have this question answered on this site.)