0

I am currently using scrapy to scrape a website. The Website is a list of profiles. So the Spider click on every link in the list (which is one profile) and then extract the data, comes back and clicks on the next one etc. This is how I structured it:

class Profiles(scrapy.Spider):
    name = 'profiles'
    allowed_domains = ['url.com']
    start_urls = ['https://www.url/profiles/']

    def parse(self, response):
        for profile in response.css('.herald-entry-content p'):
            url = response.urljoin(profile.css('a::attr(href)').extract_first())
            yield scrapy.Request(url=url, callback=self.parse_profile, dont_filter=True)

    def parse_profile(self, response):
        birth_name = response.xpath("//*[@id='post-19807']/div/div[1]/div/div[2]/div/p[1]/text()[1]").extract()
        profile = Profile(
            birth_name=birth_name
        )
        yield profile

While working, I have encountered a problem with fetching certain data. Here is a snippet of what the structure looks like on the actual profile page:

    <div class="herald-entry-content">
        <p><b>Profile: Facts<br>
        </b><br>
            <span>Stage Name:</span> Any name<br>
            <span>Birth Name:</span> Any name<br>
            <span>Birthday:</span> July 10, 1994<br>
            <span>Zodiac Sign:</span> Cancer<br>
            <span>Height:</span> 178 cm <br>
        </p>
    </div>

I would like to extract the Birth Name here, but using birth_name = response.css(".herald-entry-content p span::Text") will give me the text of the span element, which is not what i want. I tried playing around with xpath (right click and Copy Xpath in chrome) which gave me //*[@id="post-19807"]/div/div[1]/div/div[2]/div/p[1]/text()[2] Now, this works, but the post-id is specific to this page, and I do loop over the other profiles as well, so that value will change a lot. Is there any way I can tell the spider to look for the element and get the ID itself? Im kind of lost how to proceed with this.

Thanks a lot!

turbzcoding
  • 173
  • 1
  • 6
  • can you share the html with `post-id`. – supputuri Feb 15 '20 at 19:05
  • @supputuri : here u go, i removed all the scripts etc.. https://jsfiddle.net/hz9pycde/ – turbzcoding Feb 15 '20 at 22:10
  • Not sure why you want to relay on the `post-xxxxx`. You can get the `birth_name` using simple `//article//p[1]/text()[4]`. This should work on all the posts, though loop through them. – supputuri Feb 16 '20 at 03:35
  • @supputuri: That sadly looks wrong, I do get it for some of the pages, but for 95% it puts wrong text, or an "new line (slash n)" e.g ```{"birth_name": ["\n"]}, {"birth_name": ["\n"]}, {"birth_name": [" January 30, 1989", "\n"]}, {"birth_name": [" Bae Soo Bin (\ubc30\uc218\ube48)", "\n"]},``` – turbzcoding Feb 16 '20 at 09:39

1 Answers1

2

This might be a case that you have to fallback to a regular expression.

Without knowing the full structure of the page it is hard to give you exactly what you need, but here is an example using the snippet you gave

import scrapy

sel = scrapy.Selector(text="""
 <div class="herald-entry-content">
        <p><b>Profile: Facts<br>
        </b><br>
            <span>Stage Name:</span> Any name<br>
            <span>Birth Name:</span> Any name<br>
            <span>Birthday:</span> July 10, 1994<br>
            <span>Zodiac Sign:</span> Cancer<br>
            <span>Height:</span> 178 cm <br>
        </p>
    </div>
""")

info = sel.re("<span>(.+):</span>\s(.+)<br>")
output = dict(zip(*[iter(info)] * 2))
print(output)

will give you

{'Stage Name': 'Any name', 
 'Birth Name': 'Any name', 
 'Birthday': 'July 10, 1994', 
 'Zodiac Sign': 'Cancer', 
 'Height': '178 cm '}

The slightly cryptic dict(zip(*[iter(info)] * 2)) comes from here.

Note you shouldn't have to use the scrapy.Selector directly, you should be able to do something like

def parse_profile(self, response):
    herald_content = response.xpath('//div[@class="herald-entry-content"]')
    info = herald_content.re("<span>(.+):</span>\s(.+)<br>")
    # and so on from example above...
tomjn
  • 5,100
  • 1
  • 9
  • 24
  • @tomjin I have included the HTML, if that helps. But I will look at your solution tomorrow. Thanks a lot! https://jsfiddle.net/hz9pycde/ – turbzcoding Feb 15 '20 at 22:09
  • Nice use of re I would have just gotten the text of div.herald-entry-content and then cleaned it and zipped it. The regular expression probably eliminates the need for cleaning up white space. – ThePyGuy Feb 16 '20 at 22:00