Unable to scrape the name from the inner page of each result using requests

Question

I've created a script in python making use of post http requests to get the search results from a webpage. To populate the results, it is necessary to click on the fields sequentially shown here. Now a new page will be there and this is how to populate the result.

There are ten results in the first page and the following script can parse the results flawlessly.

What I wish to do now is use the results to reach their inner page in order to parse Sole Proprietorship Name (English) from there.

website address

I've tried so far with:

import re
import requests
from bs4 import BeautifulSoup

url = "https://www.businessregistration.moc.gov.kh/cambodia-master/service/create.html?targetAppCode=cambodia-master&targetRegisterAppCode=cambodia-br-soleproprietorships&service=registerItemSearch"

payload = {
    'QueryString': '0',
    'SourceAppCode': 'cambodia-br-soleproprietorships',
    'OriginalVersionIdentifier': '',
    '_CBASYNCUPDATE_': 'true',
    '_CBHTMLFRAG_': 'true',
    '_CBNAME_': 'buttonPush'
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
    res = s.get(url)
    target_url = res.url.split("&")[0].replace("view.", "update.")
    node = re.findall(r"nodeW\d.+?-Advanced",res.text)[0].strip()
    payload['_VIKEY_'] = re.findall(r"viewInstanceKey:'(.*?)',", res.text)[0].strip()
    payload['_CBHTMLFRAGID_'] = re.findall(r"guid:(.*?),", res.text)[0].strip()
    payload[node] = 'N'
    payload['_CBNODE_'] = re.findall(r"Callback\('(.*?)','buttonPush", res.text)[2]
    payload['_CBHTMLFRAGNODEID_'] = re.findall(r"AsyncWrapper(W\d.+?)'",res.text)[0].strip()

    res = s.post(target_url,data=payload)
    soup = BeautifulSoup(res.content, 'html.parser')
    for item in soup.find_all("span", class_="appReceiveFocus")[3:]:
        print(item.text)

How can I parse the Name (English) from each of the results inner page using requests?

Does this answer your question? [Unable to let my script populate result using post requests](https://stackoverflow.com/questions/60906977/unable-to-let-my-script-populate-result-using-post-requests) — αԋɱҽԃ αмєяιcαη, Apr 17 '20 at 21:16
The question that you linked is different from the one I've raised here @αԋɱҽԃ αмєяιcαη. This one is about scraping the `name` from different depth. Thanks. — asmitu, Apr 18 '20 at 04:09
I believe that I've asked you before about the final target and you confirmed that you can handle the rest but what am seeing currently that you are going from an issue to another which means you need someone to keep writing code for you. — αԋɱҽԃ αмєяιcαη, Apr 18 '20 at 14:30
@asmitu Do you have to access the inner page to scrape the English name from there? Can't you just scrape the English name from the `appReceiveFocus` elements? All of the search results seem to have the English name baked into the link. — Paul M., Apr 18 '20 at 20:38
Yep, l noticed that when I created this Post . The thing is I'll parse other fields from that page as well, so it is necessary to access the inner page. — asmitu, Apr 18 '20 at 20:48

score 2 · Answer 1 · answered Apr 22 '20 at 09:54

This is one of the ways you can parse the name from the site's inner page and then email address from the address tab. I added this function .get_email() only because I wanted to let you know as to how you can parse content from different tabs.

import re
import requests
from bs4 import BeautifulSoup

url = "https://www.businessregistration.moc.gov.kh/cambodia-master/service/create.html?targetAppCode=cambodia-master&targetRegisterAppCode=cambodia-br-soleproprietorships&service=registerItemSearch"
result_url = "https://www.businessregistration.moc.gov.kh/cambodia-master/viewInstance/update.html?id={}"
base_url = "https://www.businessregistration.moc.gov.kh/cambodia-br-soleproprietorships/viewInstance/update.html?id={}"

def get_names(s):
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
    res = s.get(url)
    target_url = result_url.format(res.url.split("id=")[1])
    soup = BeautifulSoup(res.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}

    payload['QueryString'] = 'a'
    payload['SourceAppCode'] = 'cambodia-br-soleproprietorships'
    payload['_CBNAME_'] = 'buttonPush'
    payload['_CBHTMLFRAG_'] = 'true'
    payload['_VIKEY_'] = re.findall(r"viewInstanceKey:'(.*?)',", res.text)[0].strip()
    payload['_CBHTMLFRAGID_'] = re.findall(r"guid:(.*?),", res.text)[0].strip()
    payload['_CBNODE_'] = re.findall(r"Callback\('(.*?)','buttonPush", res.text)[-1]
    payload['_CBHTMLFRAGNODEID_'] = re.findall(r"AsyncWrapper(W\d.+?)'",res.text)[0].strip()

    res = s.post(target_url,data=payload)
    soup = BeautifulSoup(res.text,"lxml")
    payload.pop('_CBHTMLFRAGNODEID_')
    payload.pop('_CBHTMLFRAG_')
    payload.pop('_CBHTMLFRAGID_')

    for item in soup.select("a[class*='ItemBox-resultLeft-viewMenu']"):
        payload['_CBNAME_'] = 'invokeMenuCb'
        payload['_CBVALUE_'] = ''
        payload['_CBNODE_'] = item['id'].replace('node','')

        res = s.post(target_url,data=payload)
        soup = BeautifulSoup(res.text,'lxml')
        address_url = base_url.format(res.url.split("id=")[1])
        node_id = re.findall(r"taba(.*)_",soup.select_one("a[aria-label='Addresses']")['id'])[0]
        payload['_CBNODE_'] = node_id
        payload['_CBHTMLFRAGID_'] = re.findall(r"guid:(.*?),", res.text)[0].strip()
        payload['_CBNAME_'] = 'tabSelect'
        payload['_CBVALUE_'] = '1'
        eng_name = soup.select_one(".appCompanyName + .appAttrValue").get_text()
        yield from get_email(s,eng_name,address_url,payload)

def get_email(s,eng_name,url,payload):
    res = s.post(url,data=payload)
    soup = BeautifulSoup(res.text,'lxml')
    email = soup.select_one(".EntityEmailAddresses:contains('Email') .appAttrValue").get_text()
    yield eng_name,email

if __name__ == '__main__':
    with requests.Session() as s:
        for item in get_names(s):
            print(item)

Output are like:

('AMY GEMS', 'amy.n.company@gmail.com')
('AHARATHAN LIN LIANJIN FOOD FLAVOR', 'skykoko344@gmail.com')
('AMETHYST DIAMOND KTV', 'twobrotherktv@gmail.com')

Before executing the above script, make sure your `BeautifulSoup` version is [4.7.0](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) or later in order for it to support [pesudo selector](https://developer.mozilla.org/en-US/docs/Web/CSS/Pseudo-classes) which I've used within the script to parse the name and email from there. — robots.txt, Apr 22 '20 at 11:40

score 1 · Answer 2 · answered Apr 20 '20 at 15:45

1

To get the Name (English) you can simply replace print(item.text) with print(item.text.split('/')[1].split('(')[0].strip()) which prints AMY GEMS

answered Apr 20 '20 at 15:45

Harish Vutukuri

1,092
6
14

1

Before proposing a solution at least try to read the post @Harish Vutukuri. If I wanted to parse that portion from it's landing page, I could do that in the very first place. However, that is not the post is about Thanks. – asmitu Apr 20 '20 at 16:21

Unable to scrape the name from the inner page of each result using requests

2 Answers2

Linked