4

I am trying to split the sentences in words.

words = content.lower().split()

this gives me the list of words like

'evening,', 'and', 'there', 'was', 'morning--the', 'first', 'day.'

and with this code:

def clean_up_list(word_list):
    clean_word_list = []
    for word in word_list:
        symbols = "~!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"
        for i in range(0, len(symbols)):
            word = word.replace(symbols[i], "")
        if len(word) > 0:
            clean_word_list.append(word)

I get something like:

'evening', 'and', 'there', 'was', 'morningthe', 'first', 'day'

if you see the word "morningthe" in the list, it used to have "--" in between words. Now, is there any way I can split them in two words like "morning","the"??

Moinuddin Quadri
  • 46,825
  • 13
  • 96
  • 126
Yun Tae Hwang
  • 1,249
  • 3
  • 18
  • 30
  • You need to split on all separators, not just white-space. This is covered in other StackOverflow questions. – Prune Jan 27 '17 at 22:01
  • possible duplicate of http://stackoverflow.com/q/13209288/3865495 – CoconutBandit Jan 27 '17 at 22:01
  • 1
    You need to use `strip()` method to delete unwanted symbols at the ends of the line. I.e. `'x-'.strip(',:-')` -> `'x'`, but `'x-y'.strip(',:-')` -> `'x-y'`. However if you want to work with real texts, you need more complex approach... Maybe NTLK should be a good start? – myaut Jan 27 '17 at 22:02
  • 2
    Use `nltk.word_tokenize(content)` or `re.findall(r'\w+',content)`. – DYZ Jan 27 '17 at 22:02
  • Possible duplicate of [How to use the regex module in python to split a string of text into the words only?](http://stackoverflow.com/questions/25496670/how-to-use-the-regex-module-in-python-to-split-a-string-of-text-into-the-words-o) – DYZ Jan 27 '17 at 22:04

5 Answers5

4

I would suggest a regex-based solution:

import re

def to_words(text):
    return re.findall(r'\w+', text)

This looks for all words - groups of alphabetic characters, ignoring symbols, seperators and whitespace.

>>> to_words("The morning-the evening")
['The', 'morning', 'the', 'evening']

Note that if you're looping over the words, using re.finditer which returns a generator object is probably better, as you don't have store the whole list of words at once.

FlipTack
  • 413
  • 3
  • 15
3

Alternatively, you may also use itertools.groupby along with str.alpha() to extract alphabets-only words from the string as:

>>> from itertools import groupby
>>> sentence = 'evening, and there was morning--the first day.'

>>> [''.join(j) for i, j in groupby(sentence, str.isalpha) if i]
['evening', 'and', 'there', 'was', 'morning', 'the', 'first', 'day']

PS: Regex based solution is much cleaner. I have mentioned this as an possible alternative to achieve this.


Specific to OP: If all you want is to also split on -- in the resultant list, then you may firstly replace hyphens '-' with space ' ' before performing split. Hence, your code should be:

words = content.lower().replace('-', ' ').split()

where words will hold the value you desire.

Moinuddin Quadri
  • 46,825
  • 13
  • 96
  • 126
1

Trying to do this with regexes will send you crazy e.g.

>>> re.findall(r'\w+', "Don't read O'Rourke's books!")
['Don', 't', 'read', 'O', 'Rourke', 's', 'books']

Definitely look at the nltk package.

John Machin
  • 81,303
  • 11
  • 141
  • 189
0

Besides the solutions given already, you could also improve your clean_up_list function to do a better work.

def clean_up_list(word_list):
    clean_word_list = []
    # Move the list out of loop so that it doesn't
    # have to be initiated every time.
    symbols = "~!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"

    for word in word_list:
        current_word = ''
        for index in range(len(word)):
            if word[index] in symbols:
                if current_word:
                    clean_word_list.append(current_word)
                    current_word = ''
            else:
                current_word += word[index]

        if current_word:
            # Append possible last current_word
            clean_word_list.append(current_word)

    return clean_word_list

Actually, you could apply the block in for word in word_list: to the whole sentence to get the same result.

kagami
  • 592
  • 4
  • 15
0

You could also do this:

import re

def word_list(text):
  return list(filter(None, re.split('\W+', text)))

print(word_list("Here we go round the mulberry-bush! And even---this and!!!this."))

Returns:

['Here', 'we', 'go', 'round', 'the', 'mulberry', 'bush', 'And', 'even', 'this', 'and', 'this']