2

I have a list of strings and I want to remove specific elements in each string from it. Here is what I have so far:

s = [ "Four score and seven years ago, our fathers brought forth on",
      "this continent a new nation, conceived in liberty and dedicated"]

result = []
for item in s:
    words = item.split()
    for item in words:
        result.append(item)

print(result,'\n')

for item in result:
    g = item.find(',.:;')
    item.replace(item[g],'')
print(result)

The output is:

['Four', 'score', 'and', 'seven', 'years', 'ago,', 'our', 'fathers', 'brought', 'forth', 'on', 'this', 'continent', 'a', 'new', 'nation,', 'conceived', 'in', 'liberty', 'and', 'dedicated']

In this case I wanted the new list to contain all the words, but it should not include any punctuation marks except for quotes and apostrophes.

 ['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers', 'brought', 'forth', 'on', 'this', 'continent', 'a', 'new', 'nation', 'conceived', 'in', 'liberty', 'and', 'dedicated']

Even though am using the find function the result seems to be same. How can I correct it prints without the punctuation marks? How can I improve upon the code?

Leon Surrao
  • 637
  • 1
  • 6
  • 12

4 Answers4

2

You can do this by using re.split to specify a regular expression to split on, in this case everything not a number or digit.

import re
result = []
for item in s:
    words = re.split("[^A-Za-z0-9]", s)
    result.extend(x for x in words if x) # Include nonempty elements
Krumelur
  • 31,081
  • 7
  • 77
  • 119
  • Testing this out, I get a bunch of repeats of my words if I use the `for` loop, but it behaves exactly the way we want if we omit the `for` loop entirely. – Engineero Jul 24 '15 at 20:45
  • Also you can add a `"+"` the end of the regex (look for one or more matching characters per group) to get rid of inter-word spaces in your matched list. I still get a blank string at the end of my matched word list when I have a period at the end of my sentence though. Not sure why. – Engineero Jul 24 '15 at 20:54
2

You could strip all the characters that you want to get rid of after you split the string:

for item in s:
    words = item.split()
    for item in words:
        result.append(item.strip(",."))  # note the addition of .strip(...)

You can add whatever characters you want to get rid of to the String argument to .strip(), all in one string. The example above strips out commas and periods.

Engineero
  • 12,340
  • 5
  • 53
  • 75
  • 1
    But beware of cases like "new nation,conceived", i.e. punctuation without whitespace. May or may not be a problem. – Krumelur Jul 24 '15 at 20:09
  • Excellent point. This was my Occam's razor approach. Regex would probably give you the most robust solution. – Engineero Jul 24 '15 at 20:11
1
s = [ "Four score and seven years ago, our fathers brought forth on", "this continent a new nation, conceived in liberty and dedicated"]

# Replace characters and split into words
result = [x.translate(None, ',.:;').split() for x in s] 

# Make a list of words instead of a list of lists of words (see http://stackoverflow.com/a/716761/1477364)
result = [inner for outer in result for inner in outer] 

print s

Output:

['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers', 'brought', 'forth', 'on', 'this', 'continent', 'a', 'new', 'nation', 'conceived', 'in', 'liberty', 'and', 'dedicated']
Travis
  • 1,998
  • 1
  • 21
  • 36
1

or, you could just add a loop in

for item in result:
    g = item.find(',.:;')
    item.replace(item[g],'')

and split up ,.:; just add an array of punctuation like

punc = [',','.',':',';']

then iterate through it inside for item in result: like

for p in punc:
    g = item.find(p)
    item.replace(item[g],'')

so the full loop is

punc = [',','.',':',';']
for item in result:
    for p in punc:
        g = item.find(p)
        item.replace(item[g],'')

I've tested this, it works.

Olivier Poulin
  • 1,778
  • 8
  • 15