Data-Structure conversion of Fields for csv output

Question

I´m still new to this and im wondering, if there is an easier way, to separate text. Right now i`m working in excel and have multiple Data in one Cell. Separating them is no fun Actually my data, a class of three fields(), looks like this (Each A can have mupltiple B; Each B has 7x C):

A, “B1,B2”, “C1,C2,C3,…, C14”

And I´d like to fill/save it like this:

A, B1, C1

A, B1, C2

…

A, B1, C7

A, B2, …

This is my code:

class Heroes1Item(scrapy.Item):
    hero_name = scrapy.Field()
    hero_builds = scrapy.Field()
    hero_buildskills = scrapy.Field()

and

import scrapy
from heroes1.items import Heroes1Item
from scrapy import Request, Item, Field

class Heroes1JobSpider(scrapy.Spider):
    name = 'heroes1_job'
    allowed_domains = ['icy-veins.com']
    start_urls = ['https://www.icy-veins.com/heroes/assassin-hero-guides']

    def parse(self, response):
        heroes_xpath = '//div[@class="nav_content_block_entry_heroes_hero"]/a/@href'
        for link in response.xpath(heroes_xpath).extract():
            yield Request(response.urljoin(link), self.parse_hero)

    def parse_hero(self, response):
        hero_names = response.xpath('//span[@class="page_breadcrumbs_item"]/text()').extract()
        hero_buildss = response.xpath('//h3[@class="toc_no_parsing"]/text()').extract()
        hero_buildskillss = response.xpath('//span[@class="heroes_build_talent_tier_visual"]').extract()

        for item in zip(hero_names, hero_buildss, hero_buildskillss):
            new_item = Heroes1Item()
            new_item['hero_name'] = item[0]
            #new_item['hero_builds'] = item[1]    DATALOSS
            #new_item['hero_buildskills'] = item[2]    DATALOSS
            new_item['hero_builds'] = response.xpath('//h3[@class="toc_no_parsing"]/text()').extract()
            new_item['hero_buildskills'] = response.xpath('//span[@class="heroes_build_talent_tier_visual"]').extract()
            yield new_item

Thanks for your help and any ideas!

I’ve provided an answer based on your question. However, if you included the URL that you are trying to parse, people could provide a simpler answer based on XPath expression that does not rely on there being always 7 skills per build. — Gallaecio, Feb 06 '19 at 13:34

Wim Hermans · Answer 1 · 2019-02-08T07:20:19.457

I think the problem lies in this part: zip(hero_names, hero_buildss, hero_buildskillss). If I understand correctly you want to make the carthesian product of the 3 lists, which you can do like this:

import itertools 

hero_lists = [hero_names, hero_buildss, hero_buildskillss]
for item in itertools.product(*hero_lists):
    new_item = Heroes1Item()
    new_item['hero_name'] = item[0]
    new_item['hero_builds'] = item[1]
    new_item['hero_buildskills'] = item[2]
    yield new_item

If there is a dependency between hero-builss & herobuildskillss, the below might work better:

hero_names = response.xpath('//span[@class="page_breadcrumbs_item"]/text()').extract()
hero_builds_xpath = response.xpath('//*[@class="heroes_build"]')
for hero_build_xpath in hero_builds_xpath:
    hero_buildss = hero_build_xpath.xpath('.//h3[@class="toc_no_parsing"]/text()').extract()
    hero_buildskillss = hero_build_xpath.xpath('.//span[@class="heroes_build_talent_tier_visual"]').extract()
    new_item = Heroes1Item()
    new_item['hero_name'] = hero_names
    new_item['hero_builds'] = hero_buildss
    new_item['hero_buildskills'] = hero_buildskillss
    yield new_item

thanks, that seems to be the right way! But it kind of over-iterates, because it´s missing a dependency between ColumnB and ColumnC. If ColumnB (build) has 2 entries, than there will be 14 entries in ColumnC (buildskills). So B1 gets C1-C7 and B2 gets C8-C14. In Case of three, B3 gets C15-C22 and so on. With your Code each B gets each C, instead of its 7-Segments of C. i`ll try to understand itertools in order fix this. If you have further ideas i appreciate that very much. Thank u so far! — Tribic, Feb 06 '19 at 02:32

Gallaecio · Accepted Answer · 2019-02-06T13:55:14.180

0

You can use a function to split build skills in chunks (like chunks() here) and do something in the lines of:

for item in zip(hero_names, hero_buildss, hero_buildskillss):
    builds = response.xpath('//h3[@class="toc_no_parsing"]/text()').extract()
    skills = response.xpath('//span[@class="heroes_build_talent_tier_visual"]').extract()
    skill_chunks = chunks(skills, 7)
    for build, skill_chunk in zip(builds, skill_chunks):
        for skill in skill_chunk:
            new_item = Heroes1Item()
            new_item['hero_name'] = item[0]
            new_item['hero_build'] = build
            new_item['hero_buildskill'] = skill
            yield new_item

edited Feb 06 '19 at 13:55

answered Feb 06 '19 at 13:32

Gallaecio

3,620
2
25
64

hey and thanks. i added this code: def chunks(l, n): for i in range(0, len(l), n): yield l[i:i + n] but it complanis about the arguments of chunks - 2 Accepted, 3 Given I haven´t figured out yet, where or what the third is coming from. – Tribic Feb 07 '19 at 17:08
Correction. With the above code, it returns *NameError: global name 'chunks' is not defined* i added **self**.chunks(skills, 7) ... then it returns the too many arguments thing. – Tribic Feb 07 '19 at 17:32
Remove the `self.` part from the call (I assume you defined the function outside the Spider class, which I believe is the best approach, but you must remove the `self.` part because of it). – Gallaecio Feb 08 '19 at 11:35
Of course i defined it inside the spider class. :) I put it outside and added two "s" in 'hero_build*s*' and 'herobuildskill*s*' - now everything is perfect. Thank u very much!! – Tribic Feb 08 '19 at 13:14

Data-Structure conversion of Fields for csv output

2 Answers2