removing specific items from a list of strings

Question

I have a list of strings and I want to remove specific elements in each string from it. Here is what I have so far:

s = [ "Four score and seven years ago, our fathers brought forth on",
      "this continent a new nation, conceived in liberty and dedicated"]

result = []
for item in s:
    words = item.split()
    for item in words:
        result.append(item)

print(result,'\n')

for item in result:
    g = item.find(',.:;')
    item.replace(item[g],'')
print(result)

The output is:

['Four', 'score', 'and', 'seven', 'years', 'ago,', 'our', 'fathers', 'brought', 'forth', 'on', 'this', 'continent', 'a', 'new', 'nation,', 'conceived', 'in', 'liberty', 'and', 'dedicated']

In this case I wanted the new list to contain all the words, but it should not include any punctuation marks except for quotes and apostrophes.

 ['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers', 'brought', 'forth', 'on', 'this', 'continent', 'a', 'new', 'nation', 'conceived', 'in', 'liberty', 'and', 'dedicated']

Even though am using the find function the result seems to be same. How can I correct it prints without the punctuation marks? How can I improve upon the code?

What is the exact output that you're expecting for the above list? — CristiFati, Jul 24 '15 at 19:52
@CristiFati the second output that he presents is his desired output i believe. I've tested his code and both his prints output the first output. — Olivier Poulin, Jul 24 '15 at 20:00

score 2 · Answer 1 · answered Jul 24 '15 at 19:49

2

You can do this by using re.split to specify a regular expression to split on, in this case everything not a number or digit.

import re
result = []
for item in s:
    words = re.split("[^A-Za-z0-9]", s)
    result.extend(x for x in words if x) # Include nonempty elements

answered Jul 24 '15 at 19:49

Krumelur

31,081
7
77
119

Testing this out, I get a bunch of repeats of my words if I use the `for` loop, but it behaves exactly the way we want if we omit the `for` loop entirely. – Engineero Jul 24 '15 at 20:45
Also you can add a `"+"` the end of the regex (look for one or more matching characters per group) to get rid of inter-word spaces in your matched list. I still get a blank string at the end of my matched word list when I have a period at the end of my sentence though. Not sure why. – Engineero Jul 24 '15 at 20:54

score 2 · Accepted Answer · answered Jul 24 '15 at 19:52

2

You could strip all the characters that you want to get rid of after you split the string:

for item in s:
    words = item.split()
    for item in words:
        result.append(item.strip(",."))  # note the addition of .strip(...)

You can add whatever characters you want to get rid of to the String argument to .strip(), all in one string. The example above strips out commas and periods.

answered Jul 24 '15 at 19:52

Engineero

12,340
5
53
75

1

But beware of cases like "new nation,conceived", i.e. punctuation without whitespace. May or may not be a problem. – Krumelur Jul 24 '15 at 20:09
Excellent point. This was my Occam's razor approach. Regex would probably give you the most robust solution. – Engineero Jul 24 '15 at 20:11

score 1 · Answer 3 · answered Jul 24 '15 at 19:58

s = [ "Four score and seven years ago, our fathers brought forth on", "this continent a new nation, conceived in liberty and dedicated"]

# Replace characters and split into words
result = [x.translate(None, ',.:;').split() for x in s] 

# Make a list of words instead of a list of lists of words (see http://stackoverflow.com/a/716761/1477364)
result = [inner for outer in result for inner in outer] 

print s

Output:

['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers', 'brought', 'forth', 'on', 'this', 'continent', 'a', 'new', 'nation', 'conceived', 'in', 'liberty', 'and', 'dedicated']

score 1 · Answer 4 · answered Jul 24 '15 at 19:58

or, you could just add a loop in

for item in result:
    g = item.find(',.:;')
    item.replace(item[g],'')

and split up ,.:; just add an array of punctuation like

punc = [',','.',':',';']

then iterate through it inside for item in result: like

for p in punc:
    g = item.find(p)
    item.replace(item[g],'')

so the full loop is

punc = [',','.',':',';']
for item in result:
    for p in punc:
        g = item.find(p)
        item.replace(item[g],'')

I've tested this, it works.

removing specific items from a list of strings

4 Answers4