I've got the Python Beautifulsoup script below (adapted to python 3 from that script ). It executes fine but nothing is returned in cmd.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
Newlines = re.compile(r'[\r\n]\s+')
def getPageText(url):
# given a url, get page content
data = urlopen(url).read()
# parse as html structured document
soup = BeautifulSoup(data, 'html.parser')
# kill javascript content
for s in soup.findAll('script'):
s.replaceWith('')
# find body and extract text
txt = soup.find('body').getText('\n')
# remove multiple linebreaks and whitespace
return Newlines.sub('\n', txt)
def main():
urls = [
'http://www.stackoverflow.com/questions/5331266/python-easiest-way-to-scrape-text-from-list-of-urls-using-beautifulsoup',
'http://stackoverflow.com/questions/5330248/how-to-rewrite-a-recursive-function-to-use-a-loop-instead'
]
txt = [getPageText(url) for url in urls]
if __name__=="__main__":
main()
Here's my cmd output
Microsoft Windows [Version 10.0..]
(c) Microsoft Corporation. All rights reserved.
C:\Users\user\Desktop\urls>python urls.py
C:\Users\user\Desktop\urls>
Why doesn't it return the pages contents?