0

I am trying to capture data from two nodes that have the same name. I am using the find method but it always pulls the value that is nested as opposed to the first child node. I've tried a few different methods to target but I am not having any success. Any help is appreciated as always.

API Response:


<affiliate_export_response>
  <success>true</success>
  <row_count>1</row_count>
  <affiliates>
      <blacklists>
        <blacklist>
          <advertiser>
            <advertiser_id xmlns="API:id_name_store">2</advertiser_id>
            <advertiser_name xmlns="API:id_name_store">wayne's Ad</advertiser_name>
          </advertiser>
          <affiliate>
            <affiliate_id xmlns="API:id_name_store">3</affiliate_id>
            <affiliate_name xmlns="API:id_name_store">Mark Affiliate</affiliate_name>
          </affiliate>
          <blacklist_reason>
            <blacklist_reason_id xmlns="API:id_name_store">1</blacklist_reason_id>
            <blacklist_reason_name xmlns="API:id_name_store">404</blacklist_reason_name>
          </blacklist_reason>
          <blacklist_type>
            <blacklist_type_id xmlns="API:id_name_store">3</blacklist_type_id>
            <blacklist_type_name xmlns="API:id_name_store">404</blacklist_type_name>
          </blacklist_type>
          <date_created>2018-04-26T00:00:00</date_created>
        </blacklist>
      </blacklists>
      <date_created>2018-01-29T11:40:58.34</date_created>
      <notes />
    </affiliate>
  </affiliates>
</affiliate_export_response>

Code:

import requests
from bs4 import BeautifulSoup

url = 'API URL'
params = {
          'param1':'dfasdf',
          'param2':3
          }
r = requests.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
for affiliate in soup.select('affiliate'):
     date_created = affiliate.find('date_created').string
     print(date_created)

The goal is to capture 2018-01-29T11:40:58.34 but I am capturing the nested date_created node inside blacklists and getting 2018-04-26T00:00:00 instead.

pppery
  • 3,731
  • 22
  • 33
  • 46
ApacheOne
  • 245
  • 2
  • 14
  • [This StackOverflow question and answer](https://stackoverflow.com/questions/6287529/how-to-find-children-of-nodes-using-beautifulsoup) may help. – c_sagan Oct 23 '19 at 23:53

2 Answers2

0

First of all your are missing the tag opening for affiliatewhich will make your parser go wrong

<affiliate_export_response>
    <success>true</success>
    <row_count>1</row_count>
    <affiliates>
        <affiliate> ======================= This was missing ================
        <blacklists>
          <blacklist>
            <advertiser>
              <advertiser_id xmlns="API:id_name_store">2</advertiser_id>
              <advertiser_name xmlns="API:id_name_store">wayne's Ad</advertiser_name>
            </advertiser>
            <affiliate>
              <affiliate_id xmlns="API:id_name_store">3</affiliate_id>
              <affiliate_name xmlns="API:id_name_store">Mark Affiliate</affiliate_name>
            </affiliate>
            <blacklist_reason>
              <blacklist_reason_id xmlns="API:id_name_store">1</blacklist_reason_id>
              <blacklist_reason_name xmlns="API:id_name_store">404</blacklist_reason_name>
            </blacklist_reason>
            <blacklist_type>
              <blacklist_type_id xmlns="API:id_name_store">3</blacklist_type_id>
              <blacklist_type_name xmlns="API:id_name_store">404</blacklist_type_name>
            </blacklist_type>
            <date_created>2018-04-26T00:00:00</date_created>
          </blacklist>
        </blacklists>
        <date_created>2018-01-29T11:40:58.34</date_created>
        <notes />
      </affiliate>
    </affiliates>
  </affiliate_export_response>

Use this to get the creation date of an affiliate only

from bs4 import BeautifulSoup
soup = BeautifulSoup(text, 'html.parser')
for affiliate in soup.select('affiliate > date_created'):
     parent = affiliate.parent
     print(affiliate.text)

Demo : Here

lagripe
  • 766
  • 6
  • 18
  • Thank you for the detailed response and for pointing out that the opening tag was missing. Missed this during my edit before posting. I have a follow question tho. I am already looking through several data points within the affiliate node. I'll need to locate date created within the created for loop. Otherwise, I am just targeting affiliate > date_created. Any recommendation on how to find this value once inside my for loop under for affiliate in soup.select('affiliate'): instead? – ApacheOne Oct 24 '19 at 00:10
0

In this particular example with bs4 4.7.1+ you can use nth-child(even) in your loop over affiliate nodes.

import requests
from bs4 import BeautifulSoup as bs

html = '''
<affiliate_export_response>
    <success>true</success>
    <row_count>1</row_count>
    <affiliates>
        <affiliate>
        <blacklists>
          <blacklist>
            <advertiser>
              <advertiser_id xmlns="API:id_name_store">2</advertiser_id>
              <advertiser_name xmlns="API:id_name_store">wayne's Ad</advertiser_name>
            </advertiser>
            <affiliate>
              <affiliate_id xmlns="API:id_name_store">3</affiliate_id>
              <affiliate_name xmlns="API:id_name_store">Mark Affiliate</affiliate_name>
            </affiliate>
            <blacklist_reason>
              <blacklist_reason_id xmlns="API:id_name_store">1</blacklist_reason_id>
              <blacklist_reason_name xmlns="API:id_name_store">404</blacklist_reason_name>
            </blacklist_reason>
            <blacklist_type>
              <blacklist_type_id xmlns="API:id_name_store">3</blacklist_type_id>
              <blacklist_type_name xmlns="API:id_name_store">404</blacklist_type_name>
            </blacklist_type>
            <date_created>2018-04-26T00:00:00</date_created>
          </blacklist>
        </blacklists>
        <date_created>2018-01-29T11:40:58.34</date_created>
        <notes />
      </affiliate>
    </affiliates>
  </affiliate_export_response>
'''
soup = bs(html, 'lxml')
for item in soup.select('affiliate'):
    date = item.select_one('date_created:nth-child(even)')
    if date is None:
        print('N/A')
    else:
        print(date.text)
QHarr
  • 83,427
  • 12
  • 54
  • 101