0

I want scrape relative website which link shared below.I need some parameters and I found the best solution like this for me.But I need scape different 2 part and I have no idea how to combine it well (combine as column) That is why I need your help.Also I am open for better solution. I need also skip some row cause of wrong scrape.Also I Dont wanna add some null rows. I will share output as a file . http://s7.dosya.tc/server14/tnx4u0/test.json.zip.html

In fact it must be table loop inside of base loop. But for show it better I did it like that for now. Thanks a lot

class KingsatSpider(Spider):
        name = 'kingsat'
        allowed_domains = ['https://tr.kingofsat.net/tvsat-turksat4a.php']
        start_urls = ['https://tr.kingofsat.net/tvsat-turksat4a.php']


    def parse(self, response):
        tables=response.xpath('//*[@class="fl"]/tr')
        bases=response.xpath('//table[@class="frq"]/tr')        

        for base in bases:
            yield {
            'Frekans':base.xpath('.//td[3]/text()').extract_first(),
            'Polarizasyon':base.xpath('.//td[4]/text()').extract_first(),
            'Kapsam':base.xpath('.//td[6]/a/text()').extract_first(),
            'SR':base.xpath('.//td[9]/a[1]/text()').extract_first(),
            'FEC':base.xpath('.//td[9]/a[2]/text()').extract_first(),
            }

            for table in tables:
                yield  {
                'channel' :table.xpath('.//td[3]/a/text()').extract_first(),
                'V-PID' : table.xpath('.//td[9]/text()[1]').extract_first(),
                'A-PID' : table.xpath('.//td[10]/text()[1]').extract_first(),
            }
  • 1
    Can you extend your question with description of what output do you need to get? – vezunchik Apr 08 '19 at 14:10
  • 1
    how they are related ? Maybe inside base loop you should make table loop and create yield with all fields. But I think you should scrape it in different way so you could keep relation between base and elements in tables. – furas Apr 08 '19 at 14:18

2 Answers2

1

Page has contruction

  • base (header)
  • table with many rows
  • base (header)
  • table with many rows

etc.

You get all headers in bases and all rows in tables as separated items but you have to get tables as single elements so you could create pairs (base, table) and then you should get rows from every table and yield with correct base


In xpath I get tables without tr - so I can create pairs (base, table-with-all-its-rows).

And then I can get rows from table and yield with its base.

I couldn't test it. Maybe you will have to skip first base - zip(bases[1:], tables)

    bases = response.xpath('//table[@class="frq"]/tr')        
    tables = response.xpath('//*[@class="fl"]')

    for base, tabel in zip(bases, tables):
        rows = table.xpath('.//tr')
        for row in rows:
            yield {
                'Frekans':      base.xpath('.//td[3]/text()').extract_first(),
                'Polarizasyon': base.xpath('.//td[4]/text()').extract_first(),
                'Kapsam':       base.xpath('.//td[6]/a/text()').extract_first(),
                'SR':           base.xpath('.//td[9]/a[1]/text()').extract_first(),
                'FEC':          base.xpath('.//td[9]/a[2]/text()').extract_first(),
                'channel' :     row.xpath('.//td[3]/a/text()').extract_first(),
                'V-PID' :       row.xpath('.//td[9]/text()[1]').extract_first(),
                'A-PID' :       row.xpath('.//td[10]/text()[1]').extract_first(),
            }
furas
  • 134,197
  • 12
  • 106
  • 148
  • You are hero furas :) Thanks a lot . A short 2 question als oI have can I edit easly columns place? And its not fit with my language (UTF-8) can I make it fit easly somehow? –  Apr 08 '19 at 14:56
  • you could change fields place in yield but I don't know if Scrapy will respect it. If you export to CSV then maybe see [CSV Exports - Ordering of columns using scrapy crawl -o output.csv](https://stackoverflow.com/questions/28368912/csv-exports-ordering-of-columns-using-scrapy-crawl-o-output-csv). – furas Apr 08 '19 at 15:11
  • I found also [Order a json by field using scrapy](https://stackoverflow.com/questions/48827688/order-a-json-by-field-using-scrapy) – furas Apr 08 '19 at 15:13
  • I don't know what you want with UTF-8. Do you want to use different encoding ? I don't know if you can do it in Scrapy. I never needed to change encoding. If I couldn't change encoding in Scrapy then I would use [iconv](https://en.wikipedia.org/wiki/Iconv) (on Linux) – furas Apr 08 '19 at 15:18
0

if the tables are related to the bases, you can just don't need to divide them into two part, it's the best way to solve. if they're not related to each other and the count of them are the same, you can use the following method.

def parse(self, response):
    tables=response.xpath('//*[@class="fl"]/tr')
    bases=response.xpath('//table[@class="frq"]/tr')        
for i in range(len(bases)):
    yield {
    'Frekans':base[i].xpath('.//td[3]/text()').extract_first(),
    'A-PID' : table[i].xpath('.//td[10]/text()[1]').extract_first(),
    }

if the count of them are not the same, you can only treat them as a whole piece. then you can deal with it in pipeline

Tom.chen.kang
  • 173
  • 2
  • 9
  • Base is like for every table and table is values of relative table. Imagine base is like header of table.I need scrape base and value of relative table. It must be over 200 channel I guess I hope you guys cna help me about solution –  Apr 08 '19 at 14:35
  • @Emre since they are related, why divide them into two part? i just saw the start_url, but the data you want to crawl doesn't correspond, so i cannot give detail solution – Tom.chen.kang Apr 08 '19 at 14:41