How to create a two-dimensional array of words of sentences from text in python?

Question

I have a text, let’s say with 5 sentences:

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing. Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Using python, how can I convert it to a two demensianal array, where each sentence is splitted in separated words.

If we take a first sentence as an example, here is what I need to be a first element of an array:

['lorem', 'ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry']

I can make it with the following commands:

string = 'Lorem Ipsum is simply dummy text of the printing and typesetting industry.'

string = string.lower()
arrWords = re.split('[^a-z]', string)
arrWords = filter(None, arrWords)
print arrWords

But how can I make the array of such elements by looping through the text of sentences?

You need to split the text into sentences and then into words. How you decide where a sentence ends can be difficult. Have you looked at the NLTK package for python? — James, Feb 24 '17 at 04:06
[i.split(' ') for i in string.split('.')] will give the list of sentences that has list of words. Hope this helps! — Keerthana Prabhakaran, Feb 24 '17 at 04:28

score 3 · Answer 1 · answered Feb 24 '17 at 04:32

Although it is usually hard to tell exactly where a sentence ends, in this case you have periods marking the end on every sentence, so we can use that to split up your paragraph into sentences. You already have the code to split it into words right, but here it is:

paragraph = "Lorem Ipsum ... "
sentences = []
while paragraph.find('.') != -1:
    index = paragraph.find('.')
    sentences.append(paragraph[:index+1])
    paragraph = paragraph[index+1:]

print sentences

Outputs:

['Lorem Ipsum is simply dummy text of the printing and typesetting industry.', 
"Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.", 
'It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.', 
'It was popularised in the 1960s with the release of Letraset sheets containing.', 
'Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.']

Then we convert them all to arrays of words:

word_matrix = []
for sentence in sentences:
    word_matrix.append(sentence.strip().split(' '))

print word_matrix

Which outputs:

[['Lorem', 'Ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry.'], 
['Lorem', 'Ipsum', 'has', 'been', 'the', "industry's", 'standard', 'dummy', 'text', 'ever', 'since', 'the', '1500s,', 'when', 'an', 'unknown', 'printer', 'took', 'a', 'galley', 'of', 'type', 'and', 'scrambled', 'it', 'to', 'make', 'a', 'type', 'specimen', 'book.'], 
['It', 'has', 'survived', 'not', 'only', 'five', 'centuries,', 'but', 'also', 'the', 'leap', 'into', 'electronic', 'typesetting,', 'remaining', 'essentially', 'unchanged.'], 
['It', 'was', 'popularised', 'in', 'the', '1960s', 'with', 'the', 'release', 'of', 'Letraset', 'sheets', 'containing.'], 
['Lorem', 'Ipsum', 'passages,', 'and', 'more', 'recently', 'with', 'desktop', 'publishing', 'software', 'like', 'Aldus', 'PageMaker', 'including', 'versions', 'of', 'Lorem', 'Ipsum.']]

Just a small moderation, in the example specification given by @roman_js , `If we take a first sentence as an example, here is what I need to be a first element of an array: ['lorem', 'ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry']` there is no period '.' at the end of the list. — Kaushik NP, Feb 24 '17 at 05:40

score 2 · Accepted Answer · answered Feb 24 '17 at 06:03

Remove commas then split by . and split again by space (with no argument to split).

paras = [[w for w in p.split()] for p in s.replace(',', '').split('.')]

This leaves you with one empty list at the end, which you could remove by a slice or by running the result through filter(None, ...)

>>> filter(None,[[w for w in p.split()] for p in s.replace(',', '').split('.')])
[['Lorem', 'Ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry'], ['Lorem', 'Ipsum', 'has', 'been', 'the', "industry's", 'standard', 'dummy', 'text', 'ever', 'since', 'the', '1500s', 'when', 'an', 'unknown', 'printer', 'took', 'a', 'galley', 'of', 'type', 'and', 'scrambled', 'it', 'to', 'make', 'a', 'type', 'specimen', 'book'], ['It', 'has', 'survived', 'not', 'only', 'five', 'centuries', 'but', 'also', 'the', 'leap', 'into', 'electronic', 'typesetting', 'remaining', 'essentially', 'unchanged'], ['It', 'was', 'popularised', 'in', 'the', '1960s', 'with', 'the', 'release', 'of', 'Letraset', 'sheets', 'containing'], ['Lorem', 'Ipsum', 'passages', 'and', 'more', 'recently', 'with', 'desktop', 'publishing', 'software', 'like', 'Aldus', 'PageMaker', 'including', 'versions', 'of', 'Lorem', 'Ipsum']]

score 1 · Answer 3 · answered Feb 24 '17 at 05:28

The challenge here is how to determine the end of the sentence. I think you can use RegEx to cover most things, but a simple list comprehension as shown below will cover the dummy text because everything ends with periods.

    x = "Lorem Ipsum is simply dummy ..."

    words = [sentence.split(" ") for sentence in x.split(". ")]

score 1 · Answer 4 · edited May 23 '17 at 11:53

Assuming that each sentence ends with '.' (like in your stated example).

Setup:

para=input("Enter the Para : ")        #input : Paragraph
sentence=[]         #Store list of sentences
word=[]             #Store final list of 2D array

List of Sentences:

sentence=para.split('.')    #Split at '.' (periods)
sentence.pop()              #Last Element will be '' due to usage of split. So pop the last element

Get the list of words:

for i in range(len(sentence)):                      #Go through each Sentence
    sentence[i]=str(sentence[i]).strip(" ")         #Strip the Whitespaces (For leading Whitespace at start of senetence)
    word.append(sentence[i].split(' '))             #Split to words and append the list to word

Print the result:

print(word)

INPUT :

Enter the Para :

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing. Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

OUTPUT :

[['Lorem', 'Ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry'], 
['Lorem', 'Ipsum', 'has', 'been', 'the', "industry's", 'standard', 'dummy', 'text', 'ever', 'since', 'the', '1500s,', 'when', 'an', 'unknown', 'printer', 'took', 'a', 'galley', 'of', 'type', 'and', 'scrambled', 'it', 'to', 'make', 'a', 'type', 'specimen', 'book'], 
['It', 'has', 'survived', 'not', 'only', 'five', 'centuries,', 'but', 'also', 'the', 'leap', 'into', 'electronic', 'typesetting,', 'remaining', 'essentially', 'unchanged'], 
['It', 'was', 'popularised', 'in', 'the', '1960s', 'with', 'the', 'release', 'of', 'Letraset', 'sheets', 'containing'], 
['Lorem', 'Ipsum', 'passages,', 'and', 'more', 'recently', 'with', 'desktop', 'publishing', 'software', 'like', 'Aldus', 'PageMaker', 'including', 'versions', 'of', 'Lorem', 'Ipsum']]

For Splitting into sentences with characters other than period '.' used as ending of a sentence, you can use re.split() function. For more information go through this link : Python: Split string with multiple delimiters

Thank you for the solution and the link provided as there are other delimiters in my text. — roman_ds, Feb 24 '17 at 08:55

How to create a two-dimensional array of words of sentences from text in python?

4 Answers4