0

I'm having a hard time understanding why my dictionary is storing only the last values after a for loop to scrap multiple pages on a website with the same structure.

pages = ['https://example.com/page1.html',
         'https://example.com/page2.html']

final_dict = {}

for i in pages: 
    url =  i
    r = requests.get(url)
    
    soup = BeautifulSoup(r.content, 'html.parser')
    first_table = soup.select_one("table:nth-of-type(1)")
    labels = first_table.findAll('td', class_='label')

    label_filter = []
    for label in labels:
        label_filter.extend(label.findAll('span', class_='txt'))

    label_filter_txt = []
    for i in label_filter:
        label_filter_txt.append(i.text)      


    label_data = []
    datapoints = first_table.findAll('td', class_='data')
    for i in datapoints:
        label_data.extend(i.findAll('span', class_='txt'))

    label_data_txt = []
    for i in label_data:
        label_data_txt.append(i.text)

    first_table_dict = dict(zip(label_filter_txt, label_data_txt))
    
    second_table = soup.select_one("table:nth-of-type(2)")
    
    labels = second_table.findAll('td', class_='label')

    label_filter = []
    for label in labels:
        label_filter.extend(label.findAll('span', class_='txt'))

    label_filter_txt = []
    for i in label_filter:
        label_filter_txt.append(i.text)

    datapoints = second_table.findAll('td', class_='data')

    label_data = []
    for i in datapoints:
        label_data.extend(i.findAll('span', class_='txt'))

    label_data_txt = []
    for i in label_data:
        label_data_txt.append(i.text)

    second_table_dict = dict(zip(label_filter_txt, label_data_txt))

    final_dict.update(first_table_dict)
    final_dict.update(second_table_dict)

df = pd.DataFrame([final_dict])

With this code my dataframe has only the data from the last URL. I'm overwriting the dict at each loop, but I don't know why.

  • 1
    do first_table_dict and second_table_dict both have the same keys? if yes, then the values in second_table_dict will overwrite the values in final_dict. – Kurt Jun 27 '22 at 16:12
  • Yes, they do. How can I append just the new values? – interferemadly Jun 27 '22 at 16:14
  • 1
    keys in a dictionary are unique, if you want multiple values at the same key, you will need to make it a list and append to the list – Kurt Jun 27 '22 at 16:14
  • So, I would need to create a dict in the first loop (or outside the loop?). And for the subsequent loops, append the values to a list? And then add those values to the dict at the end? – interferemadly Jun 27 '22 at 16:17
  • 1
    try this approach: https://stackoverflow.com/a/960735/202168 – Anentropic Jun 27 '22 at 16:22
  • @Anentropic Based on your answer, I was able to solve my problem. Thanks! – interferemadly Jun 27 '22 at 20:46

1 Answers1

0

I managed to solve my problem with setdefault. So this is my final code:

pages = ['https://example.com/page1.html',
         'https://example.com/page2.html']

label_filter = []
label_data = []

for i in pages: 
    url =  i
    r = requests.get(url)
    
    soup = BeautifulSoup(r.content, 'html.parser')
    first_table = soup.select_one("table:nth-of-type(1)")
    
    labels = first_table.findAll('td', class_='label')
    for label in labels:
        label_filter.extend(label.findAll('span', class_='txt'))
 

    datapoints = first_table.findAll('td', class_='data')
    for i in datapoints:
        label_data.extend(i.findAll('span', class_='txt'))
        
    second_table = soup.select_one("table:nth-of-type(2)")
    
    labels = second_table.findAll('td', class_='label')
    for label in labels:
        label_filter.extend(label.findAll('span', class_='txt'))


    datapoints = second_table.findAll('td', class_='data')
    for i in datapoints:
        label_data.extend(i.findAll('span', class_='txt'))

   
label_filter_txt = []
label_data_txt = []

for i in label_filter:
        label_filter_txt.append(i.text)
        
for i in label_data:
        label_data_txt.append(i.text)

my_dict = {}
for i, j in zip(label_filter_txt, label_data_txt):
    my_dict.setdefault(i, []).append(j)

df = pd.DataFrame.from_dict(my_dict,orient='index').transpose()