0

i have written some code to scrap parts of the uk Companies House website. At times some of the fields do not exist and so within the code there is a IF Else statement that checks to see if a Xpath exists, if it does not then it returns "n/a" to the variable. If I did not do this then my lists get out of balance and I would start returning the wrong date of birth for each person (in other words, I have to force the dateofbirths variable to take a string to keep everything in order)

The problem I have is that the code

dateofbirths = "n/a"

only ever returns the first letter (I.e. in this case I get the string "n" when it is actioned instead of the full "n/a".

Does anyone know why this would be?

The full code is below

import scrapy
import re

from CompaniesHouse.items import CompanieshouseItem

class CompaniesHouseSpider(scrapy.Spider):
    name = "companieshouse"
    allowed_domains = ["companieshouse.gov.uk"]
    start_urls = ["https://beta.companieshouse.gov.uk/company/OC361003/officers",
]

    def parse(self, response):
        for count in range(0,100):
            for sel in response.xpath('//*[@id="content-container"]'):
                string1 = "officer-name-" + str(count)
                names = sel.xpath('//*[@id="%s"]/a/text()' %string1).extract()
                names = [name.strip() for name in names]
                namerefs = sel.xpath('//*[@id="%s"]/a/@href' %string1).re(r'(?<=/officers/).*?(?=/appointments)')
                namerefs = [nameref.strip() for nameref in namerefs]
                string2 = "officer-role-" + str(count)
                roles = sel.xpath('//*[@id="%s"]/text()' %string2).extract()
                roles = [role.strip() for role in roles]
                string3 = "officer-date-of-birth-" + str(count)
                if sel.xpath('//*[@id="%s"]/text()' %string3):
                    dateofbirths = sel.xpath('//*[@id="%s"]/text()' %string3).extract()
                else:
                    dateofbirths = "n/a"
                dateofbirths = [dateofbirth.strip() for dateofbirth in dateofbirths]
                result = zip(names, namerefs, roles, dateofbirths)
                for name, nameref, role, dateofbirth in result:
                   item = CompanieshouseItem()
                   item['name'] = name
                   item['nameref'] = nameref
                   item['role'] = role
                   item['dateofbirth'] = dateofbirth               
                   yield item

        next_page = response.xpath('//*[@class="pager"]/li/a[@class="page"][contains(., "Next")]/@href').extract()
        if next_page:
            next_href = next_page[0]
            next_page_url = "https://beta.companieshouse.gov.uk" + next_href
            request = scrapy.Request(url=next_page_url)
            yield request
jmunsch
  • 22,771
  • 11
  • 93
  • 114
nevster
  • 371
  • 3
  • 15

1 Answers1

2

Because dateofbirths is a string?:

>>> dateofbirths = "n/a"
>>> [dateofbirth.strip() for dateofbirth in dateofbirths]
['n', '/', 'a']

Try:

>>> dateofbirths = ["n/a"]
>>> [dateofbirth.strip() for dateofbirth in dateofbirths]
['n/a']
jmunsch
  • 22,771
  • 11
  • 93
  • 114
  • @nevster no problem. :) also for future reference: http://stackoverflow.com/help/mcve it's a good way to learn how to debug, also: http://stackoverflow.com/questions/4228637/getting-started-with-the-python-debugger-pdb and `import pdb;pdb.set_trace()` – jmunsch Dec 18 '16 at 03:44