get Specify value from html with python beautifulsoup

Question

Im new in scraping, And am doing some scraping project and I trying to get value from the Html Below:

<div class="buttons_zoom"><div class="full_prod"><a href="javascript:void(0)" onclick="js:getProdID('https://www.XXXXXXX.co.il','{31F93B1D-449F-4AD7-BFB0-97A0A8E068F6}','379104')" title="לחם אחיד פרוס אנג'ל 750 גרם - פרטים נוספים"><img alt="פרטים נוספים" border="0" src="template/images/new_site/icon-view-prod-cartpage.png"/></a></div></div>

i want to get this value : 379104 which located in onclick im using BeautifulSoup The code:

 for i in page_content.find_all('div', attrs={'class':'prodPrice'}):
            temp = i.parent.parent.contents[0]

temp return list of objects and temp= to the Html Above can someone help to extract this id thanks!!

Edit****** Wow guys thanks for amazing explanation!!!!! but i have 2 issues 1.retry mechanism that no working i set it to timeout=1 in order to make it fail but once its fail its return:

requests.exceptions.RetryError: HTTPSConnectionPool(host='www.XXXXX.il', port=443): Max retries exceeded with url: /default.asp?catid=%7B2234C62C-BD68-4641-ABF4-3C225D7E3D81%7D (Caused by ResponseError('too many redirects',))

can you please help me with retry mechanism code below : 2. perfromance issues witout the retry mechanism when im set timeout=6 scraping duration of 8000 items taking 15 minutes how i can improve this code performance ? Code below:

def get_items(self, dict):
        itemdict = {}
        for k, v in dict.items():
            boolean = True
        # here, we fetch the content from the url, using the requests library
            while (boolean):
             try:
                a =requests.Session()
                retries = Retry(total=3, backoff_factor=0.1, status_forcelist=[301,500, 502, 503, 504])
                a.mount(('https://'), HTTPAdapter(max_retries=retries))
                page_response = a.get('https://www.XXXXXXX.il' + v, timeout=1)
             except requests.exceptions.Timeout:
                print  ("Timeout occurred")
                logging.basicConfig(level=logging.DEBUG)
             else:
                 boolean = False

            # we use the html parser to parse the url content and store it in a variable.
            page_content = BeautifulSoup(page_response.content, "html.parser")
            for i in page_content.find_all('div', attrs={'class':'prodPrice'}):
                parent = i.parent.parent.contents[0]
                getparentfunc= parent.find("a", attrs={"href": "javascript:void(0)"})
                itemid = re.search(".*'(\d+)'.*", getparentfunc.attrs['onclick']).groups()[0]
                itemName = re.sub(r'\W+', ' ', i.parent.contents[0].text)
                priceitem = re.sub(r'[\D.]+ ', ' ', i.text)
                itemdict[itemid] = [itemName, priceitem]

score 2 · Answer 1 · answered Feb 21 '19 at 22:33

2

Both solutions below assume regular/consistent structure to the onclick attribute

If there can only be one match then something like the following.

from bs4 import BeautifulSoup as bs

html ='''    
<div class="buttons_zoom"><div class="full_prod"><a href="javascript:void(0)" onclick="js:getProdID('https://www.XXXXXXX.co.il','{31F93B1D-449F-4AD7-BFB0-97A0A8E068F6}','379104')" title="לחם אחיד פרוס אנג'ל 750 גרם - פרטים נוספים"><img alt="פרטים נוספים" border="0" src="template/images/new_site/icon-view-prod-cartpage.png"/></a></div></div>

'''    
soup = bs(html, 'lxml')
element = soup.select_one('[onclick^="js:getProdID"]')
print(element['onclick'].split(',')[2].strip(')'))

If more than one match

from bs4 import BeautifulSoup as bs

html ='''
<div class="buttons_zoom"><div class="full_prod"><a href="javascript:void(0)" onclick="js:getProdID('https://www.XXXXXXX.co.il','{31F93B1D-449F-4AD7-BFB0-97A0A8E068F6}','379104')" title="לחם אחיד פרוס אנג'ל 750 גרם - פרטים נוספים"><img alt="פרטים נוספים" border="0" src="template/images/new_site/icon-view-prod-cartpage.png"/></a></div></div>
'''
soup = bs(html, 'lxml')
elements = soup.select('[onclick^="js:getProdID"]')
for element in elements:
    print(element['onclick'].split(',')[2].strip(')'))

answered Feb 21 '19 at 22:33

QHarr

83,427
12
54
101

Did you try this? It is more efficient, faster and reliable then using regex which you should [avoid](https://stackoverflow.com/a/1732454/6241235) when dealing with html. – QHarr Feb 22 '19 at 12:24
You are correct, one should not use Regex to parse HTML. That's the reason why I used BeautifulSoup to parse the HTML and make my way to the piece of data I'm interested in, which is `"js:getProdID('https://www.XXXXXXX.co.il','{31F93B1D-449F-4AD7-BFB0-97A0A8E068F6}','379104')"`. The data does not contain any HTML (it is clean, it's just a plain string), from this point I'm allowed to use whatever method gets the job done. You chose to use `split` and `strip` (which is a valid answer, that's the first solutions I went for as well) that's why I've up-voted your answer. – Remy J Feb 22 '19 at 23:14
But I chose to use a regular expression instead. You could argue that I'm killing a fly with a bazooka, that's an argument to which I would concede. But that gets the job done and cleanly at that. – Remy J Feb 22 '19 at 23:14
OP never mentioned he/she was after performance. I'm not convinced that your method is more efficient. The way I see it, `split` have to search the string in order to find the character to split on. It must then create a `list` containing of as many strings as there are split character plus one. `strip` would have a similar mechanic as well (from my understanding at least). That's a lot of moving pieces IMO, I fail to see why you claim that your solution is more efficient. Could you please explain ? – Remy J Feb 22 '19 at 23:14
@RemyJ I hadn't actually noticed you were only working with an isolated string at that point so I apologise. – QHarr Feb 22 '19 at 23:17
1

Don't worry about it, thank you for your apology. I have to admit I doubted my self a bit and thought that question more thoroughly... And that's thanks to you, it was a good exercise, thank you. – Remy J Feb 22 '19 at 23:30

Remy J · Accepted Answer · 2019-02-23T14:18:45.033

from bs4 import BeautifulSoup as bs
import re

txt = """<div class="buttons_zoom"><div class="full_prod"><a href="javascript:void(0)" onclick="js:getProdID('https://www.XXXXXXX.co.il','{31F93B1D-449F-4AD7-BFB0-97A0A8E068F6}','379104')" title="לחם אחיד פרוס אנג'ל 750 גרם - פרטים נוספים"><img alt="פרטים נוספים" border="0" src="template/images/new_site/icon-view-prod-cartpage.png"/></a></div></div>"""

soup = bs(txt,'html.parser')
a = soup.find("a", attrs={"href":"javascript:void(0)"})
r = re.search(".*'(\d+)'.*", data).groups()[0]
print(r) # will print '379104'

Edit

Replaced ".*\}.*,.*'(\d+)'\).*" with ".*'(\d+)'.*". They produce the same result but the latter is much cleaner.

Explanation : Soup

find the (first) element w/ an a tag where the attribute "href" has "javascript:void(0)" as its value. More about beautiful soup keyword arguments here.

a = soup.find("a", attrs={"href":"javascript:void(0)"})

This is equivalent to

a = soup.find("a", href="javascript:void(0)")

In older versions of Beautiful Soup, which don’t have the class_ shortcut, you can use the attrs trick mentioned above. Create a dictionary whose value for “class” is the string (or regular expression, or whatever) you want to search for. -- see beautiful soup documentation about "attrs"

a points to an element of type <class 'bs4.element.Tag'>. We can access the tag attributes like we would do for a dictionary via the property a.attrs (more about that at beautiful soup attributes). That's what we do in the following statement.

a_tag_attributes = a.attrs # that's the dictionary of attributes in question...

The dictionary keys are named after the tags attributes. Here we have the following keys/attributes name : 'title', 'href' and 'onclick'.
We can check that out for ourselves by printing them.

print(a_tag_attributes.keys()) # equivalent to print(a.attrs.keys())

This will output

dict_keys(['title', 'href', 'onclick']) # those are the attributes names (the keys to our dictionary)

From here, we need to get the data we are interested in. The key to our data is "onclick" (it's named after the html attribute where the data we seek lays).

data = a_tag_attributes["onclick"] # equivalent to data = a.attrs["onclick"]

data now holds the following string.

"js:getProdID('https://www.XXXXXXX.co.il','{31F93B1D-449F-4AD7-BFB0-97A0A8E068F6}','379104')"

Explanation : Regex

Now that we have isolated the piece that contains the data we want, we're going to extract just the portion we need.
We'll do so by using a regular expression (this site is an excellent resource if you want to know more about Regex, good stuff).

To use regular expression in Python we must import the Regex module re. More about the "re" module here, good good stuff.

import re

Regex lets us search a string that matches a pattern.

Here the string is our data, and the pattern is ".*'(\d+)'.*" (which is also a string as you can tell by the use of the double quotes).

You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager. The regex equivalent is ^.*\.txt$.

Best you read about regular expressions to further understand what it is about. Here's a quick start, good good good stuff.

Here we search for a string. We describe the string as having none or an infinite number of characters. Those characters are followed by some digits (at least one) and an enclosed in single quotes. Then we have some more characters.

The parenthesis is used to extract a group (that's called capturing in regex), we capture just the part that's a number.

By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together. This allows you to apply a quantifier to the entire group or to restrict alternations to part of the regex.
Only parentheses can be used for grouping. Square brackets define a character class, and curly braces are used by a quantifier with specific limits. -- Use Parentheses for Grouping and Capturing

r = re.search(".*'(\d+)'.*", data)

Defining the symbols :

.* matches any character (except for line terminators), * means there can be none or infinite amount
' matches the character '
\d+ matches a least one digit (equal to [0-9]); that's the part we capture
(\d+) Capturing Group; this means capture the part of the string where a digit is repeated at least one
() are used for capturing, the part that match the pattern within the parentheses are saved.

The part captured (if any) can later be access with a call to r.groups() on the result of a re.search.
This returns a tuple containing what was captured or None(r refers to the results of the re.search function call).

In our case the first (and only) item of the tuple are the digits...

captured_group = r.groups()[0] # that's the tuple containing our data (we captured...)

We can now access our data which is at the first index of the tuple (we only captured one group)

 print(captured_group[0]) # this will print out '379104'

So appreciate can you explain please this 2 lines : a = soup.find("a", attrs={"href":"javascript:void(0)"}) r = re.search(".*\}.*,.*'(\d+)'\).*", a.attrs['onclick']).groups()[0] — djiso1, Feb 21 '19 at 23:30
Sure. I'll add somme comments to my code to give some explanation. — Remy J, Feb 21 '19 at 23:39
Wow guys thanks for amazing explanation!!!!! but i have 2 issues 1.retry mechanism that no working i set it to timeout=1 in order to make it fail but once its fail its return: requests.exceptions.RetryError: HTTPSConnectionPool(host='www.XXXXX.il', port=443): Max retries exceeded with url: /default.asp?catid=%7B2234C62C-BD68-4641-ABF4-3C225D7E3D81%7D (Caused by ResponseError('too many redirects',)) can you please help me with retry mechanism code below : 2. perfromance issues witout the retry mechanism when im set timeout=6 scrapping duration of 8000 items taking 15 minutes — djiso1, Feb 24 '19 at 17:41
Thanks Remy can you also help with the performance issue and retry mechanism — djiso1, Feb 24 '19 at 17:47

get Specify value from html with python beautifulsoup

2 Answers2

Explanation : Soup

Explanation : Regex