Edit **
BeautifulSoup Grab Visible Webpage Text Is working perfectly fine.
I was using this solution from elsewhere on stackoverflow ( BeautifulSoup Grab Visible Webpage Text ) to get text out of webpages with beautiful soup:
import requests
from bs4 import BeautifulSoup
# error handling
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
# settings
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url = "http://imfuna.com"
response = requests.get(url, headers=headers, verify=False)
soup = BeautifulSoup(response.text, "lxml")
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
front_text_count = len(text.split(" "))
print front_text_count
print text
For most sites it works nicely but for the url example above ( imfuna.com ) it only retrieves 6 words despite the fact that the webpage has many more words (e.g. "Digital inspections for the residential or commercial property surveyor").
In the case of the above example words not included in the text output with this code, the actual code sits inside p/h1 tags and I can't understand why it isn't picked up by the code?
Can someone else suggest a way to simply read the plain text from the webpage in a way that properly picks it all up?
Thanks!