0

I am trying to read all the text from a webpage, but I am only getting the unhidden text. The page I am trying to read has an action button "Read More" hiding part of the text.

<button type="submit" class="ActionButtonComponent-action link dark" ink-ripple="" ng-class="[$ctrl.type, $ctrl.theme, $ctrl.loading ? '__loading' : '']" ng-click="$ctrl.callback($event)" ng-disabled="$ctrl.inactive || $ctrl.loading"><span class="ActionButtonComponent-action-txt" translate="$ctrl.translateKey">Mehr lesen</span><ng-transclude></ng-transclude><div class="ActionButtonComponent-action-loading"></div><div class="ink-ripple"></div></button>
<span class="ActionButtonComponent-action-txt" translate="$ctrl.translateKey">Mehr lesen</span>
<ng-transclude></ng-transclude>
<div class="ActionButtonComponent-action-loading"></div>
<div class="ink-ripple"></div>
<button type="submit" class="ActionButtonComponent-action link dark" ink-ripple="" ng-class="[$ctrl.type, $ctrl.theme, $ctrl.loading ? '__loading' : '']" ng-click="$ctrl.callback($event)" ng-disabled="$ctrl.inactive || $ctrl.loading"><span class="ActionButtonComponent-action-txt" translate="$ctrl.translateKey">Mehr lesen</span><ng-transclude></ng-transclude><div class="ActionButtonComponent-action-loading"></div><div class="ink-ripple"></div></button>
<action-button type="'link'" action="$ctrl.toggleDescription()" translate-key="$ctrl.showFullDescription ? 'COMPONENT.SEO_PAGE.LESS' : 'COMPONENT.SEO_PAGE.MORE'"><button type="submit" class="ActionButtonComponent-action link dark" ink-ripple="" ng-class="[$ctrl.type, $ctrl.theme, $ctrl.loading ? '__loading' : '']" ng-click="$ctrl.callback($event)" ng-disabled="$ctrl.inactive || $ctrl.loading"><span class="ActionButtonComponent-action-txt" translate="$ctrl.translateKey">Mehr lesen</span><ng-transclude></ng-transclude><div class="ActionButtonComponent-action-loading"></div><div class="ink-ripple"></div></button></action-button>

The code I am using to read is:

url = "url_to_read"
headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36"}

keys_folder = "Keys"
excel_file = "excel_file.xlsx"

def getHTML(url):
    full_html = requests.get(url, headers=headers).text
    soup = BeautifulSoup(full_html,features="lxml")
    for a in soup.findAll('a'):
        del a["href"]
    return soup.text[2500:]

def getKeyWords(excel_file):
    df = pd.read_excel(keys_folder + "\\" + excel_file)
    return df["Query"]

def clean(paragraphs):
    pars = []
    for p in paragraphs:
        p = p.replace("<p>","")
        p = p.replace("</p>","")
        pars.append(p)
    return pars

def freq(html, key_words): 
    kv = []
    for s in key_words:
        s += " "
        a = {s : html.lower().count(s.lower())}
        kv.append(a)
    return kv

key_words = getKeyWords(excel_file)
html = getHTML(url)
freqs = freq(html, key_words)

result = Counter()

for elem in freqs:
    for key, value in elem.items():
        result[key] += value

df = pd.DataFrame(result.items(), columns = ["Query", "Count"])
df.to_excel("Results\\Result " + excel_file[:-5] + ".xlsx")
print(df)

Can someone help me with this?

  • 1
    Does this answer your question? [Web-scraping JavaScript page with Python](https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python) – ChrisGPT was on strike Apr 20 '20 at 15:42
  • I have tried it, and I was not able to do it. Can you help me with it? The page that I am trying to get text from is https://de.twin.com – AntonioSCP Barroso Apr 21 '20 at 14:16
  • "Can you help me with it?"—No. Please read [Why is "Can someone help me?" not an actual question?](https://meta.stackoverflow.com/a/284237/354577) – ChrisGPT was on strike Apr 21 '20 at 14:18

0 Answers0