0

I know this is a common question but I haven't found an applicable answer. I'm trying to remove the punctuation from a list of words, which I have gotten from scraping an HTML page in an earlier function. Here is what I have:

import re
def strip_text():    
        list_words = get_text().split()
        print(list_words)
        for i in range(len(list_words)):
            list_words = re.sub("[^a-zA-Z]"," ",list_words)
            list_words = list_words.lower()
        return list_words
    print(get_text()) 
    print(strip_text())

I realize that this doesn't work because the re.sub bit is supposed to be used on a string, not a list. Is there an equally efficient way to do this? Should I make the list of words a string again?

edit: this problem is scraping the text from an HTML page, like I said. The code before what I have above looks like this:

from bs4 import BeautifulSoup
import requests
from collections import Counter
import re
tokens = []
types= Counter(tokens)
#str_book = ""
str_lines = ""
import string

def get_text(): 
   # str_lines = ""
    url = 'http://www.gutenberg.org/files/1155/1155-h/1155-h.htm'
    r = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data, 'html.parser')
    text = soup.find_all('p') #finds all of the text between <p>
    i=0
    for p in text:
        i+=1
        line = p.get_text()
        if (i<10):
            continue
        print(line)
    return line

So the list of words would be a list of all the words in the Agatha Christie book that I'm using. Hopefully that helps.

Alanan
  • 31
  • 1
  • 4
  • This doesn't answer your question directly but I wanted to point out the Beautiful Soup package handles a lot of activities related to web scraping - so if you're currently writing your own functions, might be worth looking into – HFBrowning Dec 01 '16 at 17:03
  • 1
    Thanks - yeah, I use BeautifulSoup in my get_text function! Definitely makes that part a ton easier. – Alanan Dec 01 '16 at 17:05
  • 1
    `import string; list_words = [s.translate(None, string.punctuation) for s in list_words]`, using list comprehension with [this](http://stackoverflow.com/a/266162/6779606) answer. – Stephen B Dec 01 '16 at 17:12
  • 1
    According to my system translate should only take 1 argument - that answer is 8 years old so maybe stuff has changed? – Alanan Dec 01 '16 at 17:21
  • 1
    @Alanan Do you have punctuation marks inside your words or just at the beginning / end of each word? – ettanany Dec 01 '16 at 17:24
  • @Alanan, I am using Python 2.7. I see it has changed for Python 3.x, as per [this answer](http://stackoverflow.com/questions/34293875/how-to-remove-punctuation-marks-from-a-string-in-python-3-x-using-translate). So unfortunately using `translate` doesn't appear to be a simple one-liner anymore. Now you would need to create a dictionary mapping for every punctuation character and use that dictionary as the argument in `translate`, and I'm not sure if it's faster than just regex now. – Stephen B Dec 01 '16 at 17:30
  • 1
    @Alanan, your `get_text()` function isn't correct actually, and you should not get *list of words* from this function as you say. – Ahsanul Haque Dec 01 '16 at 18:10
  • @AhsanulHaque but when I split get_text(), which should return a string, using list_words = get_text().split(), will that not return a list? (Sorry if this is basic stuff, I'm still very new to Python) – Alanan Dec 01 '16 at 19:23
  • @Alanan, let me explain a bit. In your `text` variable you have a list of ``. Then you are iterating and getting only the *text* for every *p* inside the `line` variable. `line` contains a sentence like thing (a few words). Until this everything is okay. Then, you are not storing value of your `line` variable in each iteration, instead you are only returning the value of last iteration, which is wrong. Also, I don't understand the logic why you are using these `if i<10:continue`. May be you want to fetch every tenth line. I don't know. – Ahsanul Haque Dec 01 '16 at 19:34
  • @AhsanulHaque I'm using if i<10:continue because I wanted to get rid of the first ten lines of the text (it was the Table of Contents, and I just wanted the novel's contents). I'm trying to store the value of the line variable in each iteration but I'm unsure of how, as you say, I'm only returning the value of the last iteration. Is that an indentation problem? – Alanan Dec 01 '16 at 20:00
  • @Alanan, I could explain it, but Pynoob already added a separate answer which is nice and satisfy your query. Best of luck. – Ahsanul Haque Dec 01 '16 at 20:09

2 Answers2

4

You don't need regex at all. string.punctuation contains all of the punctations. Just iterate and skip those.

>>> import string
>>> ["".join( j for j in i if j not in string.punctuation) for i in  lst]
Ahsanul Haque
  • 10,676
  • 4
  • 41
  • 57
  • I tried this in the place of my for loop (fixed a little so what you had as lst was list_words, etc), but it still returns the text to me with all its punctuation. Hmm. – Alanan Dec 01 '16 at 17:39
  • @Alanan, does it just appear that way because of the print statement in the loop of `get_text()`? It appears the last line and thus the line returned by `get_text()` is blank, causing `strip_text()` to do nothing and return an empty list. – Stephen B Dec 01 '16 at 17:50
  • @PyNoob do you mean the print statement print(line) under continue? When I comment that out, it won't return my text at all. – Alanan Dec 01 '16 at 18:53
  • @Alanan sort of. I mean your `get_text()` method doesn't actually return all of the lines of the book, but only the last line which happens to be blank. So the `return` value (the value in `list_words` in `split_text()`) is an empty list. Thus, your `print` statement in `get_text()` gives your code the appearance of working but never removing any punctuation, because `strip_text()` doesn't remove any punctuation and returns that same empty list. – Stephen B Dec 01 '16 at 19:28
1

Taking a look at get_text(), it appears we need to modify a few things before we can remove any punctuation. I've added some comments in here.

def get_text(): 
    str_lines = []  # create an empty list
    url = 'http://www.gutenberg.org/files/1155/1155-h/1155-h.htm'
    r = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data, 'html.parser')
    text = soup.find_all('p') #finds all of the text between <p>
    i=0
    for p in text:
        i+=1
        line = p.get_text()
        if (i<10):
            continue
        str_lines.append(line)  # append the current line to the list
    return str_lines  # return the list of lines

First, I uncommented your str_lines variable and set it to an empty list. Next, I replaced the print statement with code to append the line to the list of lines. Finally, I changed the return statement to return that list of lines.

For strip_text(), we can reduce it to a few lines of code:

def strip_text():    
    list_words = get_text()
    list_words = [re.sub("[^a-zA-Z]", " ", s.lower()) for s in list_words]
    return list_words

There is no need to operate on a per-word basis because we can look at the entire line and remove all punctuation, so I removed the split(). Using list comprehension, we can alter every element of the list in a single line, and I also put the lower() method in there to condense the code.

To implement the answer provided by @AhsanulHaque, you just need to substitute that second line of the strip_text() method with it, as shown:

def strip_text():
    list_words = get_text()
    list_words = ["".join(j.lower() for j in i if j not in string.punctuation)
                  for i in list_words]
    return list_words

For fun, here is that translate method I mentioned earlier implemented for Python 3.x, as described here:

def strip_text():
    list_words = get_text()
    translator = str.maketrans({key: None for key in string.punctuation})
    list_words = [s.lower().translate(translator) for s in list_words]
    return list_words

Unfortunately I cannot time any of these for your particular code because Gutenberg blocked me temporarily (too many runs of the code too quickly, I suppose).

Community
  • 1
  • 1
Stephen B
  • 1,246
  • 1
  • 10
  • 23
  • Wow - this was so incredibly helpful! For some reason \r's, \n's, commas, apostrophes, and quotation marks are still in the text, but I might just brute-force those out. If you have any more tips, they are welcome, but otherwise, thanks so much for your time/effort and your lengthy explanations - really helped. – Alanan Dec 01 '16 at 20:11
  • 1
    btw, `string.punctuations` includes `'!"#$%&\'()*+,-./:;<=>?@[\\]^_\`{|}~'`. All these characters should be ignored. – Ahsanul Haque Dec 01 '16 at 20:17