Scraping pages that have session token

Question

I'm currently using a combination of Scrapy and Selenium to quickly search the USPTO TradeMark database. These pages have a session token attached.

The things I've tried and read about don't seem to be integrated enough- meaning that while Selenium can pass found URLs to scrapy, scrapy makes a new request to that page, thus invalidating the token, so I need Selenium to deliver the HTML to scrapy for parsing. Is this possible?

# -*- coding: utf-8 -*-
# from terminal run: scrapy crawl trademarks -o items.csv -t csv

import time
import scrapy
from scrapy.http import Request
from scrapy.item import Item, Field
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider

from selenium import webdriver

class TrademarkscrapeItem(scrapy.Item):
    category = Field()
    wordmark = Field()
    registrant = Field()
    registration_date = Field()
    description = Field()

class TradeMarkSpider(CrawlSpider):
    name = "trademarks"
    allowed_domains = ["uspto.gov"]
    start_urls = ['http://www.uspto.gov']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        # Navigate through the site to get to the page I want to scrape
        self.driver.get(response.url)
        next = self.driver.find_element_by_xpath("//*[@id='menu-84852-1']/a")
        next.click()
        time.sleep(2) # Let any js render in page
        next = self.driver.find_element_by_xpath("//*[@id='content']/article/ul[1]/li[1]/article/h4/a")
        next.click()
        time.sleep(2)

    # How to get this next part to point at Selenium-delivered HTML?
        TradeDict = {}
        SelectXpath = Selector(SeleniumHTML).xpath #SeleniumHTML is psuedoCode
        TradeDict['description'] = SelectXpath("//*[@id='content']/article/div/p/text()").extract()

        self.driver.close()
        return TradeDict

What are you trying to parse? If you want the trademark page why not start there? — Padraic Cunningham, Feb 21 '16 at 20:55
the listings have the token embedded in url - you have to click Search TM database link (TESS) at http://www.uspto.gov/trademarks-application-process/search-trademark-database to access — Benjamin James, Feb 21 '16 at 21:17
Is there a reason you are not starting at http://tmsearch.uspto.gov/bin/gate.exe?f=searchss&state=4801:pdhuq4.1.1? — Padraic Cunningham, Feb 21 '16 at 21:36
the token is already generated by the time that page is reached. starting there gives me a 'session expired' error. Is there something you're seeing that I'm not? — Benjamin James, Feb 21 '16 at 21:40
I can get to the search page by starting at http://tmsearch.uspto.gov/bin/gate.exe?f=tess&state=4808:qi3w9p.1.1" and passing the cookies to requestsm I use phantomjs but the logic should be the same whatever you use http://pastebin.com/4Fj9tRjZ, if you use `s.get("http://tmsearch.uspto.gov/bin/showfield?f=doc&state=4808:a1hln9.3.1", headers={"user-agent":ua}).content` you will see the output for the query `foobar` — Padraic Cunningham, Feb 21 '16 at 21:55
not sure why I didn't think about cookies. I must be new. sincere thanks for the help! — Benjamin James, Feb 21 '16 at 22:05

Scraping pages that have session token

0 Answers0