I know this is a common question but I haven't found an applicable answer. I'm trying to remove the punctuation from a list of words, which I have gotten from scraping an HTML page in an earlier function. Here is what I have:
import re
def strip_text():
list_words = get_text().split()
print(list_words)
for i in range(len(list_words)):
list_words = re.sub("[^a-zA-Z]"," ",list_words)
list_words = list_words.lower()
return list_words
print(get_text())
print(strip_text())
I realize that this doesn't work because the re.sub bit is supposed to be used on a string, not a list. Is there an equally efficient way to do this? Should I make the list of words a string again?
edit: this problem is scraping the text from an HTML page, like I said. The code before what I have above looks like this:
from bs4 import BeautifulSoup
import requests
from collections import Counter
import re
tokens = []
types= Counter(tokens)
#str_book = ""
str_lines = ""
import string
def get_text():
# str_lines = ""
url = 'http://www.gutenberg.org/files/1155/1155-h/1155-h.htm'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
text = soup.find_all('p') #finds all of the text between <p>
i=0
for p in text:
i+=1
line = p.get_text()
if (i<10):
continue
print(line)
return line
So the list of words would be a list of all the words in the Agatha Christie book that I'm using. Hopefully that helps.