I am currently using scrapy to scrape a website. The Website is a list of profiles. So the Spider click on every link in the list (which is one profile) and then extract the data, comes back and clicks on the next one etc. This is how I structured it:
class Profiles(scrapy.Spider):
name = 'profiles'
allowed_domains = ['url.com']
start_urls = ['https://www.url/profiles/']
def parse(self, response):
for profile in response.css('.herald-entry-content p'):
url = response.urljoin(profile.css('a::attr(href)').extract_first())
yield scrapy.Request(url=url, callback=self.parse_profile, dont_filter=True)
def parse_profile(self, response):
birth_name = response.xpath("//*[@id='post-19807']/div/div[1]/div/div[2]/div/p[1]/text()[1]").extract()
profile = Profile(
birth_name=birth_name
)
yield profile
While working, I have encountered a problem with fetching certain data. Here is a snippet of what the structure looks like on the actual profile page:
<div class="herald-entry-content">
<p><b>Profile: Facts<br>
</b><br>
<span>Stage Name:</span> Any name<br>
<span>Birth Name:</span> Any name<br>
<span>Birthday:</span> July 10, 1994<br>
<span>Zodiac Sign:</span> Cancer<br>
<span>Height:</span> 178 cm <br>
</p>
</div>
I would like to extract the Birth Name
here, but using birth_name = response.css(".herald-entry-content p span::Text")
will give me the text of the span element, which is not what i want. I tried playing around with xpath (right click and Copy Xpath in chrome) which gave me //*[@id="post-19807"]/div/div[1]/div/div[2]/div/p[1]/text()[2]
Now, this works, but the post-id
is specific to this page, and I do loop over the other profiles as well, so that value will change a lot. Is there any way I can tell the spider to look for the element and get the ID itself? Im kind of lost how to proceed with this.
Thanks a lot!