i have written some code to scrap parts of the uk Companies House website. At times some of the fields do not exist and so within the code there is a IF Else statement that checks to see if a Xpath exists, if it does not then it returns "n/a" to the variable. If I did not do this then my lists get out of balance and I would start returning the wrong date of birth for each person (in other words, I have to force the dateofbirths variable to take a string to keep everything in order)
The problem I have is that the code
dateofbirths = "n/a"
only ever returns the first letter (I.e. in this case I get the string "n" when it is actioned instead of the full "n/a".
Does anyone know why this would be?
The full code is below
import scrapy
import re
from CompaniesHouse.items import CompanieshouseItem
class CompaniesHouseSpider(scrapy.Spider):
name = "companieshouse"
allowed_domains = ["companieshouse.gov.uk"]
start_urls = ["https://beta.companieshouse.gov.uk/company/OC361003/officers",
]
def parse(self, response):
for count in range(0,100):
for sel in response.xpath('//*[@id="content-container"]'):
string1 = "officer-name-" + str(count)
names = sel.xpath('//*[@id="%s"]/a/text()' %string1).extract()
names = [name.strip() for name in names]
namerefs = sel.xpath('//*[@id="%s"]/a/@href' %string1).re(r'(?<=/officers/).*?(?=/appointments)')
namerefs = [nameref.strip() for nameref in namerefs]
string2 = "officer-role-" + str(count)
roles = sel.xpath('//*[@id="%s"]/text()' %string2).extract()
roles = [role.strip() for role in roles]
string3 = "officer-date-of-birth-" + str(count)
if sel.xpath('//*[@id="%s"]/text()' %string3):
dateofbirths = sel.xpath('//*[@id="%s"]/text()' %string3).extract()
else:
dateofbirths = "n/a"
dateofbirths = [dateofbirth.strip() for dateofbirth in dateofbirths]
result = zip(names, namerefs, roles, dateofbirths)
for name, nameref, role, dateofbirth in result:
item = CompanieshouseItem()
item['name'] = name
item['nameref'] = nameref
item['role'] = role
item['dateofbirth'] = dateofbirth
yield item
next_page = response.xpath('//*[@class="pager"]/li/a[@class="page"][contains(., "Next")]/@href').extract()
if next_page:
next_href = next_page[0]
next_page_url = "https://beta.companieshouse.gov.uk" + next_href
request = scrapy.Request(url=next_page_url)
yield request