0

i am trying to build web scraper on python with BeautifulSoup libary. I want to get information from all pages of bitcoin forum topic topic. i am using the following code to get username , status, date and time of post , post text , activity, merit from forum https://bitcointalk.org/index.php?topic=2056041.0

url='https://bitcointalk.org/index.php?topic=2056041.0'
from bs4 import BeautifulSoup
import requests
import re

def get_html(url):
    r = requests.get(url)
    return r.text


html=get_html(url)
soup=BeautifulSoup(html, 'lxml')

results= soup.findAll("td", {"valign" : "top"})
usernames=[]
for i in results:
    x=i.findAll('b')
    try:
        s=str(x[0])
        if 'View the profile of' in s :
            try:
              found = re.search('of (.+?)">', s).group(1)
              if found.isdigit()==False:
                usernames.append(found)
            except Exception as e :print(e)

    except Exception as e :pass#print(e)
print(len(usernames))
status=[]


for i in results:
    x=i.findAll("div", {"class": "smalltext"})
    s=str(x)
    try:
       found = re.search(' (.+?)<br/>', s).group(1)
       if len(found)<25:
          status.append(found)
    except:pass
print(len(status))


activity=[]
for i in results:
    x=i.findAll("div", {"class": "smalltext"})
    s=str(x)
    try:
        x=s.split('Activity: ')[1]
        x=x.split('<br/>')[0]
        activity.append(x)

    except Exception as e :pass   
print(activity)
print(len(activity))
posts=[]
for i in results:
    x=i.findAll("div", {"class": "post"})
    s=str(x)
    try:
        x=s.split('="post">')[1]
        x=x.split('</div>]')[0]
        if x.isdigit()!=True:
            posts.append(x)

    except Exception as e :pass


print(len(posts))

i feel what its a very ugly and not correct solution using re all these try except and etc. Is there more straight and elegant solution for this task?

egorkh
  • 478
  • 8
  • 24

1 Answers1

1

You're right. It's ugly.

You say you're trying to scrape using BeautifulSoup, but you don't use the parsed soup object anywhere. If you were going to convert the soup object into a string and parse it using regex, you might as well have skipped the import of BeautifulSoup and used regex directly on r.text.

Using regex to parse HTML is a bad idea. Here's why:

RegEx match open tags except XHTML self-contained tags

You seemed to have merely discovered that BeautifulSoup can be used to parse HTML, but haven't bothered going through the documentation:

BeautifulSoup Documentation

Learn how to navigate the HTML tree. Their official documentation is more than enough for simple tasks like this:

usernames = []
statuses = []
activities = []
posts = []

for i in soup.find_all('td', {'class': 'poster_info'}):
    j = i.find('div', {'class': 'smalltext'}).find(text=re.compile('Activity'))
    if j:
        usernames.append(i.b.a.text)
        statuses.append(i.find('div', {'class': 'smalltext'}).contents[0].strip())
        activities.append(j.split(':')[1].strip())
        posts.append(i.find_next('td').find('div', {'class': 'post'}).text.strip())

Here's the result of printing their lengths:

>>> len(usernames), len(statuses), len(activities), len(posts)
(20, 20, 20, 20)

And here are the actual contents:

for i, j, k, l in zip(usernames, statuses, activities, posts):
    print('{} - {} - {}:\n{}\n'.format(i, j, k, l))

Result:

hous26 - Full Member - 280:
Just curious.  Not counting anything less than a dollar in total worth.  I own 9 coin types:

satoshforever - Member - 84:
I own three but plan to add three more soon. But is this really a useful question without the size of the holdings?

.
.
.

papajamba - Full Member - 134:
7 coins as of the moment. Thinking of adding xrp again though too. had good profit when it was only 800-900 sats
  • thanks a lot . i guess i need something like this : for i in soup.find_all('td', {'class': 'td_headerandpost'}): jj = i.find('div', {'class': 'smalltext'}) if jj : go.append(jj) print(jj.text) – egorkh Apr 20 '18 at 23:17