0

Edit **

BeautifulSoup Grab Visible Webpage Text Is working perfectly fine.


I was using this solution from elsewhere on stackoverflow ( BeautifulSoup Grab Visible Webpage Text ) to get text out of webpages with beautiful soup:

import requests
from bs4 import BeautifulSoup

# error handling

from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

# settings

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

url = "http://imfuna.com"

response = requests.get(url, headers=headers, verify=False)

soup = BeautifulSoup(response.text, "lxml")

for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
text = '\n'.join(chunk for chunk in chunks if chunk)

front_text_count = len(text.split(" "))
print front_text_count
print text

For most sites it works nicely but for the url example above ( imfuna.com ) it only retrieves 6 words despite the fact that the webpage has many more words (e.g. "Digital inspections for the residential or commercial property surveyor").

In the case of the above example words not included in the text output with this code, the actual code sits inside p/h1 tags and I can't understand why it isn't picked up by the code?

Can someone else suggest a way to simply read the plain text from the webpage in a way that properly picks it all up?

Thanks!

CodeGuru
  • 3,645
  • 14
  • 55
  • 99
the_t_test_1
  • 1,193
  • 1
  • 12
  • 28
  • This website seems to load it's content dynamically. If you want to see what a browser would see, run `mechanize`. Otherwise, yes, those are the only words that exists on the page. – Artyer Jun 08 '17 at 16:59
  • 1
    I did everything like your script, except for using html.parser instead of lxml, and I got all the text on the page just fine. – randomdude999 Jun 08 '17 at 17:21
  • A) mechanize made no difference for me, still only 6 words but B) ...randomdude999 could you post your code? – the_t_test_1 Jun 08 '17 at 17:53
  • @randomdude999 could you post your code for html.parser ? If it works I'll mark it correct :) – the_t_test_1 Jun 13 '17 at 08:42
  • 1
    @the_t_test_1 just the same as yours, but replace "lxml" with "html.parser". Although you should try it aswell to confirm that it is an issue with lxml. – randomdude999 Jun 13 '17 at 09:59
  • @randomdude999 I tried putting html.parser in but it still only returns 6 words (for the example domain)... no errors or anything. Can you confirm how many words you got? More than 6 right? I can't think what could cause this to be different if we're running the exact same code (above, replacing "lxml" with "html.parser" – the_t_test_1 Jun 13 '17 at 11:32
  • @randomdude999 still can't get this working... sorry to pester but if you can make it work it'd be great to know how & I'll mark the answer correct etc. :) – the_t_test_1 Jun 16 '17 at 12:30
  • Tried mechanize, html.parser, and a bunch of other stuff but still only get 6 words. Do either @randomdude999 or someone else have an idea what might be going wrong here / can they post or demo an answer that gets more than 6 words? – the_t_test_1 Jun 19 '17 at 16:48
  • Hey @randomdude999 I realised it's because the url isn't redirecting to http://imfuna.com/home-uk/ ... if I do that url then it retrieves all the text. I just need to get requests to allow itself to redirect... – the_t_test_1 Jun 22 '17 at 15:15

0 Answers0