0

I´m still new to this and im wondering, if there is an easier way, to separate text. Right now i`m working in excel and have multiple Data in one Cell. Separating them is no fun Actually my data, a class of three fields(), looks like this (Each A can have mupltiple B; Each B has 7x C):

A, “B1,B2”, “C1,C2,C3,…, C14”

And I´d like to fill/save it like this:

A, B1, C1

A, B1, C2

A, B1, C7

A, B2, …

This is my code:

class Heroes1Item(scrapy.Item):
    hero_name = scrapy.Field()
    hero_builds = scrapy.Field()
    hero_buildskills = scrapy.Field()

and

import scrapy
from heroes1.items import Heroes1Item
from scrapy import Request, Item, Field

class Heroes1JobSpider(scrapy.Spider):
    name = 'heroes1_job'
    allowed_domains = ['icy-veins.com']
    start_urls = ['https://www.icy-veins.com/heroes/assassin-hero-guides']

    def parse(self, response):
        heroes_xpath = '//div[@class="nav_content_block_entry_heroes_hero"]/a/@href'
        for link in response.xpath(heroes_xpath).extract():
            yield Request(response.urljoin(link), self.parse_hero)

    def parse_hero(self, response):
        hero_names = response.xpath('//span[@class="page_breadcrumbs_item"]/text()').extract()
        hero_buildss = response.xpath('//h3[@class="toc_no_parsing"]/text()').extract()
        hero_buildskillss = response.xpath('//span[@class="heroes_build_talent_tier_visual"]').extract()

        for item in zip(hero_names, hero_buildss, hero_buildskillss):
            new_item = Heroes1Item()
            new_item['hero_name'] = item[0]
            #new_item['hero_builds'] = item[1]    DATALOSS
            #new_item['hero_buildskills'] = item[2]    DATALOSS
            new_item['hero_builds'] = response.xpath('//h3[@class="toc_no_parsing"]/text()').extract()
            new_item['hero_buildskills'] = response.xpath('//span[@class="heroes_build_talent_tier_visual"]').extract()
            yield new_item

Thanks for your help and any ideas!

stranac
  • 26,638
  • 5
  • 25
  • 30
Tribic
  • 108
  • 1
  • 9
  • I’ve provided an answer based on your question. However, if you included the URL that you are trying to parse, people could provide a simpler answer based on XPath expression that does not rely on there being always 7 skills per build. – Gallaecio Feb 06 '19 at 13:34

2 Answers2

0

I think the problem lies in this part: zip(hero_names, hero_buildss, hero_buildskillss). If I understand correctly you want to make the carthesian product of the 3 lists, which you can do like this:

import itertools 

hero_lists = [hero_names, hero_buildss, hero_buildskillss]
for item in itertools.product(*hero_lists):
    new_item = Heroes1Item()
    new_item['hero_name'] = item[0]
    new_item['hero_builds'] = item[1]
    new_item['hero_buildskills'] = item[2]
    yield new_item

If there is a dependency between hero-builss & herobuildskillss, the below might work better:

hero_names = response.xpath('//span[@class="page_breadcrumbs_item"]/text()').extract()
hero_builds_xpath = response.xpath('//*[@class="heroes_build"]')
for hero_build_xpath in hero_builds_xpath:
    hero_buildss = hero_build_xpath.xpath('.//h3[@class="toc_no_parsing"]/text()').extract()
    hero_buildskillss = hero_build_xpath.xpath('.//span[@class="heroes_build_talent_tier_visual"]').extract()
    new_item = Heroes1Item()
    new_item['hero_name'] = hero_names
    new_item['hero_builds'] = hero_buildss
    new_item['hero_buildskills'] = hero_buildskillss
    yield new_item
Wim Hermans
  • 2,098
  • 1
  • 9
  • 16
  • thanks, that seems to be the right way! But it kind of over-iterates, because it´s missing a dependency between ColumnB and ColumnC. If ColumnB (build) has 2 entries, than there will be 14 entries in ColumnC (buildskills). So B1 gets C1-C7 and B2 gets C8-C14. In Case of three, B3 gets C15-C22 and so on. With your Code each B gets each C, instead of its 7-Segments of C. i`ll try to understand itertools in order fix this. If you have further ideas i appreciate that very much. Thank u so far! – Tribic Feb 06 '19 at 02:32
  • Maybe this approach works better (updated in the answer) – Wim Hermans Feb 08 '19 at 07:18
0

You can use a function to split build skills in chunks (like chunks() here) and do something in the lines of:

for item in zip(hero_names, hero_buildss, hero_buildskillss):
    builds = response.xpath('//h3[@class="toc_no_parsing"]/text()').extract()
    skills = response.xpath('//span[@class="heroes_build_talent_tier_visual"]').extract()
    skill_chunks = chunks(skills, 7)
    for build, skill_chunk in zip(builds, skill_chunks):
        for skill in skill_chunk:
            new_item = Heroes1Item()
            new_item['hero_name'] = item[0]
            new_item['hero_build'] = build
            new_item['hero_buildskill'] = skill
            yield new_item
Gallaecio
  • 3,620
  • 2
  • 25
  • 64
  • hey and thanks. i added this code: def chunks(l, n): for i in range(0, len(l), n): yield l[i:i + n] but it complanis about the arguments of chunks - 2 Accepted, 3 Given I haven´t figured out yet, where or what the third is coming from. – Tribic Feb 07 '19 at 17:08
  • Correction. With the above code, it returns *NameError: global name 'chunks' is not defined* i added **self**.chunks(skills, 7) ... then it returns the too many arguments thing. – Tribic Feb 07 '19 at 17:32
  • Remove the `self.` part from the call (I assume you defined the function outside the Spider class, which I believe is the best approach, but you must remove the `self.` part because of it). – Gallaecio Feb 08 '19 at 11:35
  • Of course i defined it inside the spider class. :) I put it outside and added two "s" in 'hero_build*s*' and 'herobuildskill*s*' - now everything is perfect. Thank u very much!! – Tribic Feb 08 '19 at 13:14