How to capture data from website as key-value pairs from the website using python?

Question

output generated: 1 code used to get Model name:2

enter code here
test_link = 'https://www.amd.com/en/products/cpu/amd-ryzen-9-3900xt'
r = requests.get(test_link, headers=headers)
soup = BeautifulSoup(r.content,'lxml')
whole_data = soup.find('div', class_='fieldset-wrapper')
specifications = []
specifications_value=[]
for variable1 in whole_data.find_all('div', class_='field__label'):
    #print(variable1.text)
    variable1 = variable1.text
    specifications = list(variable1.split('\n'))
    #print(specifications)
for variable2 in whole_data.find_all('div', class_='field__item'):
    #print(variable2.text)
    variable2 = variable2.text
    specifications_value = list(variable2.split('\n'))
    #print(specifications_value)

issue:i am getting the data, but in separate variables and for loops, how to map these two variable using key-value pairs? so that i can check conditions like: if the value is platform then only tale it's value(box processor)

i want to capture the data in such a way that if the 'key' is platform then only capture it's value(boxed processor). similarly for all other 14 tags.

Please don't include text as screenshots. Stack Overflow has many [formatting features](/help/formatting) you can use to include text such as code, output, etc. in your question. Consider replacing your screenshot of text with a code block containing the actual text. — Pranav Hosangadi, Dec 23 '21 at 18:54
Does this answer your question? [How to iterate through two lists in parallel?](https://stackoverflow.com/questions/1663807/how-to-iterate-through-two-lists-in-parallel) You want to iterate over `whole_data.find_all('div', class_='field__label')` and `whole_data.find_all('div', class_='field__item')` simultaneously. — Pranav Hosangadi, Dec 23 '21 at 18:55
@PranavHosangadi thank you! yes i want to iterate through these 2. But i want to check also that if the 1st list is == platform then only pick the value from the 2nd list.otherwise leave it blank. for ex: if Product Family is not there then i have to leave it blank. both are in different list how will i map these? — jayinee desai, Dec 23 '21 at 19:07
When you iterate over those two in parallel instead of in separate loops as shown in the link I shared, you will get one value of `specifications` and the corresponding value of `specifications_value`. — Pranav Hosangadi, Dec 23 '21 at 20:45
@QHarr thanks for replying. yes i have the URL with missing value: — jayinee desai, Dec 23 '21 at 21:54
@QHarr thanks for replying. yes i have the URL with missing value: https://www.amd.com/en/products/cpu/amd-ryzen-7-3800xt - this url is having all the required things. But other url : https://www.amd.com/en/products/cpu/amd-ryzen-9-3900xt is not having CPU SOCKET — jayinee desai, Dec 23 '21 at 22:02

score 0 · Accepted Answer · answered Dec 24 '21 at 00:22

0

You can iterate over a list of expected keys and use :-soup-contains to target the description node. If that is not None then select the child values. Otherwise, return ''.

import requests
from bs4 import BeautifulSoup as bs

links = ['https://www.amd.com/en/products/cpu/amd-ryzen-7-3800xt',
         'https://www.amd.com/en/products/cpu/amd-ryzen-9-3900xt']

all_keys = ['Platform', 'Product Family', 'Product Line', '# of CPU Cores',
            '# of Threads', 'Max. Boost Clock', 'Base Clock', 'Total L2 Cache', 'Total L3 Cache',
            'Default TDP', 'Processor Technology for CPU Cores', 'Unlocked for Overclocking', 'CPU Socket',
            'Thermal Solution (PIB)', 'Max. Operating Temperature (Tjmax)', 'Launch Date', '*OS Support']

with requests.Session() as s:

    s.headers = {'User-Agent': 'Mozilla/5.0'}

    for link in links:

        r = s.get(link)
        soup = bs(r.content, 'lxml')
        specification = {}

        for key in all_keys:

            spec = soup.select_one(
                f'.field__label:-soup-contains("{key}") + .field__item, .field__label:-soup-contains("{key}") + .field__items .field__item')

            if spec is None:
                specification[key] = ''
            else:
                if key == '*OS Support':
                    specification[key] = [
                        i.text for i in spec.parent.select('.field__item')]
                else:
                    specification[key] = spec.text

        print(specification)
        print()

answered Dec 24 '21 at 00:22

QHarr

83,427
12
54
101

thank you for this code. As i am new to this, i am not able to understand it. i am breaking it into small parts and trying to understand. – jayinee desai Dec 24 '21 at 14:27
I look for the elements with class .field__item that have the desired search text e.g. 'Platform'. If that is present, I move to the adjacent element and grab its value or children's values (if OS spec info). If that particular spec description e.g. CPU Socket is not present on a given webpage then the value of spec will be None and so I return '' – QHarr Dec 24 '21 at 16:04
1

thanks for all the help! i understood it. i hope i am also able to write the code like you one day soon. – jayinee desai Dec 24 '21 at 20:00
i have 1 more query how can i include Model also as key when i tried with the code it is giving below output:{'Model': [AMD Ryzen™ 7 3800XT | Desktop Processor | AMD]} – jayinee desai Dec 25 '21 at 06:37
looks like you need to add .text to the end of spec i.e. spec.text when assigning to dictionary. – QHarr Dec 25 '21 at 06:38
i have added code and output images in the question. when i am trying to do .text . it is giving error :Traceback (most recent call last): File "C:\Users\admin\PycharmProjects\project1_AMD\Test.py", line 28, in specification[key] = spec.text File "C:\Users\admin\AppData\Local\Programs\Python\Python39\lib\site-packages\bs4\element.py", line 2253, in __getattr__ raise AttributeError( AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()? – jayinee desai Dec 25 '21 at 06:43
Is there an url with Model included? – QHarr Dec 25 '21 at 07:08
i got it.. i had to add logic in the IF CONDITION with 'None' thank you @Qharr .still trying to get it's text... i will get that also eventually. – jayinee desai Dec 25 '21 at 07:17
we have to use .select('Title') to get model name. Hence i gave key as Model and waited for its value to come as 'None' then in that "IF" condition i added: if spec is None: specification[key] = '' if key == 'Model': spec = soup.select('Title') specification[key] = [i.text for i in spec] print(spec) – jayinee desai Dec 25 '21 at 07:20
`soup.select_one('.page-title')` – QHarr Dec 25 '21 at 07:39
thank you so much for all the help! i have one more question... model_list = sub_url + model_list #print(model_list) ----it is giving me list of url without any comma model_links.append(model_list) ------it is giving me model_links = ['https://www.amd.com/en/products/cpu/amd-ryzen-9-3900xt'] ['https://www.amd.com/en/products/cpu/amd-ryzen-7-3800xt'] But i want model_links = ['https://www.amd.com/en/products/cpu/amd-ryzen-9-3900xt','https://www.amd.com/en/products/cpu/amd-ryzen-7-3800xt'] as single list – jayinee desai Dec 27 '21 at 06:02
please open a new question with all that info and your code. It is difficult to follow in comments and others are less likely to see. – QHarr Dec 27 '21 at 07:00

How to capture data from website as key-value pairs from the website using python?

1 Answers1