(new problem) Python BeautifulSoup - How to catch text while keep scrolling down? (Web Crawler)

Question

I have fixed the previous problem. But now I face a new issue. when I search Sheeran in the page, if "Shreeran" in j: everything is just fine. However, if i add one more keyword like concert, the result will be generated randomly. For example; if "Shreeran" or "concert" in j:. How can i fix it?

while True:
    url ='https://xxxxxxxxx/{}'.format(pagenum)
    driver.get(url)
    pagesource = driver.page_source
    soup = BeautifulSoup(pagesource, 'lxml')
    if url == "https://xxxxxxxxxx/5":
        break
    else:
        for s in soup.find_all("div", class_="_2cNsJna0_hV8tdMj3X6_gJ"):
            for j in s:
                if "Sheeran" in j: # only search Sheeran is fine but if i change it to "Sheeran" or "concert", the result will be generated randomly
                    print(s.text)


    pagenum+=1

    time.sleep(2)

how can i search somethings with multiple key words?

I think you need to scroll down first to generate more content. Wait for a certain number of scrolling-down or till you reach the end of the page, then use beautiful soup to get the page source — Anwarvic, Apr 14 '19 at 11:27
@Anwarvic I have tried but the url is still the original one so the soup.find cannot find the next page which the link is https//xxxxxxx/page/2 — James Ho, Apr 14 '19 at 11:33
First, the most common way to scroll is to use `driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")`. This will get to the end of the page. Try this instead of `js.send_keys(Keys.PAGE_DOWN)` — Anwarvic, Apr 14 '19 at 11:36
This might help you wiht BeautifulSoup and Selenium: [https://stackoverflow.com/questions/50595558/selenium-scroll-and-beautifoul-soup-loop] — geekandglitter, Apr 14 '19 at 11:41
@Anwarvic it does not work. The reason maybe the page is a div so i have to find the element first — James Ho, Apr 14 '19 at 11:58
@geekandglitter it doesn't work. ```driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") ``` doesn't work. — James Ho, Apr 14 '19 at 12:05

Sachin · Answer 1 · 2019-04-14T14:41:02.433

0

Alternate approach would be to find out how site is getting content when scrolling happens.

You can try increasing pagenumbers in loop.

pagenum = 1
while True:
    url ='https://lihkg.com/thread/1082050/page/{}'.format(pagenum)
    driver.get(url)
    pagesource = driver.page_source
    soup = BeautifulSoup(pagesource, 'lxml')
    profile_links = soup.find('a', attrs={'href': re.compile('/profile'))
    if not profile_links:
        break
    pagenum+=1
    # page is valid, continue with code to extract results

or use there API url that appears in network traffic.

edited Apr 14 '19 at 14:41

answered Apr 14 '19 at 14:09

Sachin

51
1
6

I've thought about this method but I dont know how to loop the pagenumbers. Could you give me an example? – James Ho Apr 14 '19 at 14:13
trick is usually to find some elements that do not appear in page with invalid number. It looks like there will be no profile links in right panel if page number is not valid. I have edited answer to use that. Hope it helps. – Sachin Apr 14 '19 at 14:43
Thank you for your help. your suggestions are helpful and I modify my code based on your suggestions. But now I face a new problem which I mentioned above. Could you give me some advises? – James Ho Apr 14 '19 at 17:49
It should be if "Shreeran" in j or "concert" in j. – Sachin Apr 15 '19 at 06:46
Could you accept the answer and ask new question instead of edits. – Sachin Apr 15 '19 at 06:48

(new problem) Python BeautifulSoup - How to catch text while keep scrolling down? (Web Crawler)

1 Answers1