How do I extract text from inbetween a TD tag using BeautifulSoup and requests in Python 2.7

Question

I'm trying extract text from inbetween a TD tag using BeautifulSoup and requests in Python 2.7. So far using this code I get nothing :(

import requests
from bs4 import BeautifulSoup

# Set up the Spider

def card_search(max_pages):
    page = 1
    mtgset = 'portal'
    card = 'lava-axe'

    while page <= max_pages:
        url = 'http://www.mtgotraders.com/store/search-results.html?q=lava+axe&x=0&y=0'
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text)

        for text in soup.findAll('td',{'class': 'price mod'}):
            pagetext = text.get('td')

            print(pagetext)
            page += 1

card_search(1)

I'm trying to automate sorting and value my MTG card collection so results from the site used in the code example are pretty important. I know the site can be parsed because I got it to return links. Sadly, I just can't get plain text to happen.

Here is the code used to pull links, but its not directed at the table. Just the page itself.

import requests
from bs4 import BeautifulSoup

# Set up the Spider

def card_search(max_pages):
    page = 1
    mtgset = 'portal'
    card = 'lava-axe'

    while page <= max_pages:
        url = 'http://www.mtgotraders.com/store/search-results.html?q=lava+axe&x=0&y=0'
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text)

        for text in soup.findAll('a'):
            pagetext = text.get('href')

            print(pagetext)
            page += 1

card_search(1)

Kind Regards, Sour Jack

When I fetch that URL, I don't get anything with a table in it at all. I don't know whether the URL is wrong (I notice you're not using the `mtgset` and `page` variables anywhere in the URL…), you're supposed to be doing a POST with some information in the body instead of a GET, you're running into deep-linking protection or similar, or the table is generated by JavaScript rather than statically part of the page. But regardless, if the table isn't there, you can't parse the table. — abarnert, Apr 20 '15 at 03:41
And this has nothing to do with your Python code; if I just pass the same URL to curl or wget, I get the same page with no table in it. So, if you actually got this to work for pulling links out of the table, you'll need to post the code that actually does that, not code that doesn't. — abarnert, Apr 20 '15 at 03:44
Abarnert, Correct mtgset and card variables are not being used. page is used to count in the while loop. If the table is being generated by Javascript or what not, is there any other way to pull the data? — Sour Jack, Apr 20 '15 at 03:45
Abarnet, I edited the original post to show the code that pulls links from the page. The code to pull the links are not aimed at the table though. Just the page itself. — Sour Jack, Apr 20 '15 at 03:50

score 0 · Answer 1 · edited May 23 '17 at 11:50

0

If you want to be able to have more flexibility with your scraping you will need something like phantomJs. Look at Pykler's anwser here.

edited May 23 '17 at 11:50

Community

1
1

answered Apr 20 '15 at 04:04

kpie

9,588
5
28
50

1

Thank You everyone who responded. PhantomJs with Selenium looks like the ticket. – Sour Jack Apr 21 '15 at 02:14

How do I extract text from inbetween a TD tag using BeautifulSoup and requests in Python 2.7

1 Answers1