how to extract one element from a site

Question

i want to extract contact information from the site: http://www.smtnet.com/company/index.cfm?fuseaction=view_company&company_id=49509 i already done by the code as below:

from scrapy.selector import Selector
from selenium import webdriver
driver = webdriver.Chrome('D:\chromedriver_win32\chromedriver.exe')
driver.get( "http://www.smtnet.com/company/index.cfm?fuseaction=view_company&company_id=49509")
sel = Selector(text=driver.page_source)
Company_Name = sel.xpath('*//p[1]/strong/text()').extract_first()
Country=sel.xpath('*//p[2]/text()').extract()[-2]
webSite = ????

but i failed to exact the company website,it should be https://www.europlacer.com/.

ay one can tell me how to extract it?

Shivam Mishra · Accepted Answer · 2018-07-15T07:35:13.947

If you just want the href attribute of "Visit Website" button, then use this:

Company_URL = sel.xpath("//div[@id = 'tabs-1']/p[3]/a/@href").extract_first()

But, the above code will return you only this:

act_open_company_page.cfm?url_id=70098

Since the URL of the company (i.e. 'https://www.europlacer.com/') is NOT directly stored in the href attribute. (It is resolved later using a javascript) But if you closely look at the source:

<a onclick="return trackOutboundLink('company_url','http://www.europlacer.com','49509');" href="act_open_company_page.cfm?url_id=70098" target="_blank" class=""><img src="/images/buttons/visit-website.jpg" alt="Visit EUROPLACER website" class=""></a>

You can see the direct URL is present as an argument to the function in onclickattribute so you need to extract it out from there. First, to extract the onclick attribute's value, do this:

URL = sel.xpath("//div[@id = 'tabs-1']/p[3]/a/@onclick").extract_first()

Then, extract your required URL from it like this:

URL = URL.split(",")[1]
URL = URL.strip("\'")  // to remove the leading and trailing quotes

Another method to extract the URL would be to actually resolve the value of the href attribute. You can see, when you click on the link, it becomes something like:

http://www.smtnet.com/company/act_open_company_page.cfm?url_id=70098

So, the trick would be to prepend the hostname ("http://www.smtnet.com"), load the URL and then extract the loaded URL once it changes. But the first method I described in my answer would be lot easier.

Additionally for the company name, I think you should try this:

Company_Name = sel.xpath('//header/h1/text()').extract_first()

Since, the above line prints only the company name (i.e. "EUROPLACER"). Your code takes in some text as well.

You are welcome. If you found the answer useful, please accept my answer. — Shivam Mishra, Jul 15 '18 at 11:29

score 0 · Answer 2 · answered Jul 15 '18 at 06:58

When you inspect the Visit website button in the developer console, you see this

<a onclick="return trackOutboundLink('company_url','http://www.europlacer.com','49509');" href="act_open_company_page.cfm?url_id=70098" target="_blank">
    <img src="/images/buttons/visit-website.jpg" alt="Visit EUROPLACER website">
  </a>

You want to grab the Anchor element and retrieve the URL from the onclick attribute like so

company_link_handler = sel.xpath('//*[@id="tabs-1"]/p[3]/a').attrib.get('onclick')
website = company_link_handler.split(',')[1]

thirdDeveloper · Answer 3 · 2018-07-15T11:28:00.803

First

Find a unique element: you can't use CSS class or element's id to get the element, so you have to find a unique element that helps you to get the targeted element. This img can help you:

So, you can get it like this:

sel.xpath('//img[@src="/images/buttons/visit-website.jpg"]')

Second

Get targeted element: how can this unique element help you? The element with the company URL is its parent node (we can reach it with /..) and we need its onclick:

sel.xpath('//img[@src="/images/buttons/visit-website.jpg"]/../@onclick')

Final step

Extract demanded text: you can use many methods and tools, I just test regex and it works properly:

s=sel.xpath('//img[@src="/images/buttons/visit-website.jpg"]/../@onclick').extract_first()
x=re.search("(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s']{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s']{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]\.[^\s']{2,}|www\.[a-zA-Z0-9]\.[^\s']{2,})",s)
result=x.group(0)

Note that I make little changes and use the regex pattern mentioned here. Don't forget to import re package also.

thanks, but i check regex, but it don't work for me, it returns expected string or buffer — Yan Zhang, Jul 15 '18 at 10:30
It was my fault, `extract_first()` was missed. I updated the code. — thirdDeveloper, Jul 15 '18 at 11:28

how to extract one element from a site

3 Answers3