How to create a list that contains only the first instance of each word found in a string (excluding punctuations, newlines, etc.)

Question

Alright all you genius programmers and developers you... I could really use some help on this one, please.

I'm currently taking the 'Python for Everybody Specialization', that's offered through Coursera (https://www.coursera.org/specializations/python), and I'm stuck on an assignment.

I cannot figure out how to create a list that contains only the first instances of each word that's found in a string:

Example string:

my_string = "How much wood would a woodchuck chuck,
             if a woodchuck would chuck wood?"

Desired list:

words_list = ['How', 'much', 'wood', 'would',
              'a', 'woodchuck', 'chuck', 'if']

Thank you all for your time, consideration, and contributions!

Remove punctuation and use a `set`? Youll likely get more answers if you posted your own atempt first... — Fredrik Pihl, Aug 31 '17 at 21:25
As a start, see this https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python — Fredrik Pihl, Aug 31 '17 at 21:27
Why so many answers to a question that shows no effort from the OPs side? — Fredrik Pihl, Aug 31 '17 at 21:38
@FredrikPihl Because it's relatively simple and people seem to be bored. I agree that there are probably more "deserving" questions though. — nyrocron, Aug 31 '17 at 21:42
@FredrikPihl: I felt that if the contributors didn't have a syntactual constraint (i.e., some silly code snippet I slopped together in a sad attempt to appear less of a newbie coder, which I believe would have only caused confusion as to my desired output anyways)... that it would allow them the freedom to express how they would achieve the desired output. — ScriptNasty, Aug 31 '17 at 22:16

score 1 · Accepted Answer · answered Aug 31 '17 at 21:21

1

You can build a list with words that have already been seen and filter non alphabetic characters:

my_string = "How much wood would a woodchuck chuck, if a woodchuck would chuck wood?"

new_l = []
final_l = []

for word in my_string.split():
    word = ''.join(i for i in word if i.isalpha())
    if word not in new_l:
       final_l.append(word)
       new_l.append(word)

Output:

['How', 'much', 'wood', 'would', 'a', 'woodchuck', 'chuck', 'if']

answered Aug 31 '17 at 21:21

Ajax1234

69,937
8
61
102

This is brilliant. I love how you only used the standard library to achieve the desired output!; thank you for sharing! – ScriptNasty Aug 31 '17 at 22:56
@ScriptNasty Glad to help! – Ajax1234 Aug 31 '17 at 23:01

score 1 · Answer 2 · answered Aug 31 '17 at 21:36

This can be accomplished in 2 steps, first remove punctuation and then add the words to a set which will remove duplicates.

Python 3:

from string import punctuation #  This is a string of all ascii punctuation characters

trans = str.maketrans('', '', punctuation)

text = 'How much wood would a woodchuck chuck, if a woodchuck would chuck wood?'.translate(trans)

words = set(text.split())

Pyhton 2:

from string import punctuation #  This is a string of all ascii punctuation characters

text = 'How much wood would a woodchuck chuck, if a woodchuck would chuck wood?'.translate(None, punctuation)

words = set(text.split())

Sets are unordered, so this solution will not help if you need the words in order. — Dalvenjia, Aug 31 '17 at 21:38

wencakisa · Answer 3 · 2017-08-31T21:44:10.623

You can use the re module and cast result to a set in order to remove duplicates:

>>> import re

>>> my_string = "How much wood would a woodchuck chuck, if a woodchuck would chuck wood?"
>>> words_list = re.findall(r'\w+', my_string)  # Find all words in your string (without punctuation)
>>> words_list_unique = sorted(set(words_list), key=words_list.index)  # Cast your result to a set in order to remove duplicates. Then cast again to a list.

>>> print(words_list_unique)
['How', 'much', 'wood', 'would', 'a', 'woodchuck', 'chuck', 'if']

Explanation:

\w means character, \w+ means word.
So you use re.findall(r'\w+', my_string) in order to find all the words in my_string.
A set is a collection with unique elements, so you cast your result list from re.findall() into a set.
Then you recast to a list (sorted) in order to get a list with unique words from your string.
EDIT - If you want to preserve the order of the words, you can use sorted() with a key=words_list.index in order to keep them ordered, because sets are unordered collections.

`\w+` is equivalent to `[a-zA-Z0-9_]+`, which OP may not want to match. — Zach Gates, Aug 31 '17 at 22:46

score 0 · Answer 4 · answered Aug 31 '17 at 21:30

0

Since all instances of a word are identical, I'm going to take the question to mean that you want a unique list of words that appear in the string. Probably the easiest way to do this is:

import re
non_unique_words = re.findall(r'\w+', my_string)
unique_words = list(set(non_unique_words))

The 're.findall' command will return any word, and converting to a set and back to a list will make the results unique.

answered Aug 31 '17 at 21:30

sasquires

356
3
15

That seems to achieve the individual word slicing efficietly, but upon invoking print(unique_words), they're all unordered. I'd like to have all the words indexed within the list. – ScriptNasty Aug 31 '17 at 22:30
`\w+` is equivalent to `[a-zA-Z0-9_]+`, which OP may not want to match. – Zach Gates Aug 31 '17 at 22:45
@ZachGates Sure. OP needs to define "word" more clearly for this level of detail to matter. – sasquires Sep 01 '17 at 18:13
@ScriptNasty It's important to include such details in the original question. In that case you'll have to use another method to make the list of words unique. OrderedSet is one, but probably the simplest in standard Python is ZachGates's suggestion. The only problem I see with it is that it is $O(n^2)$ if the string has many words. – sasquires Sep 01 '17 at 18:16

score 0 · Answer 5 · answered Aug 31 '17 at 21:31

0

Try it:

my_string = "How much wood would a woodchuck chuck, if a woodchuck would chuck wood?"
def replace(word, block):
    for i in block:
        word = word.replace(i, '')
    return word
my_string = replace(my_string, ',?')
result = list(set(my_string.split()))

answered Aug 31 '17 at 21:31

amarynets

1,765
10
27

What about if I write a `'?!'` in the end of the sentence? – wencakisa Aug 31 '17 at 21:36
You can use `punctuation` function from string module for delete all punctuation. – amarynets Aug 31 '17 at 21:43

nyrocron · Answer 6 · 2017-08-31T22:06:34.383

If you need to preserve the order the words appear in:

import string
from collections import OrderedDict

def unique_words(text):
    without_punctuation = text.translate({ord(c): None for c in string.punctuation})
    words_dict = OrderedDict((k, None) for k in without_punctuation.split())
    return list(words_dict.keys())

unique_words("How much wood would a woodchuck chuck, if a woodchuck would chuck wood?")
# ['How', 'much', 'wood', 'would', 'a', 'woodchuck', 'chuck', 'if']

I use OrderedDict because there does not appear to be an ordered set in the Python standard library.

Edit:

To make the word list case insensitive one could make the dictionary keys lowercase: (k.lower(), None) for k in ...

score 0 · Answer 7 · answered Sep 01 '17 at 00:01

0

It should be sufficient to find all of the words, and then filter out the duplicates.

words = re.findall('[a-zA-Z]+', my_string)
words_list = [w for idx, w in enumerate(words) if w not in words[:idx]]

answered Sep 01 '17 at 00:01

Zach Gates

4,045
1
27
51

How to create a list that contains only the first instance of each word found in a string (excluding punctuations, newlines, etc.)

7 Answers7