0

I'm trying to scrape the information inside an 'iframe' tag. When I execute this code, it says that 'USER_AGENT' is not defined. How can I fix this?

import requests
from bs4 import BeautifulSoup

page = requests.get("https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances" + "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000", headers=USER_AGENT, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all('iframe')
Senura Dissanayake
  • 654
  • 1
  • 9
  • 29
j.doe
  • 77
  • 1
  • 1
  • 9

2 Answers2

2

The error is telling you clearly what is wrong. You are passing in as headers USER_AGENT, which you have not defined earlier in your code. Take a look at this post on how to use headers with the method.

The documentation states you must pass in a dictionary of HTTP headers for the request, whereas you have passed in an undefined variable USER_AGENT.

From the Requests Library API:

headers = None

Case-insensitive Dictionary of Response Headers.

For example, headers['content-encoding'] will return the value of a 'Content-Encoding' response header.

EDIT:

For a better explanation of Content-Type headers, see this SO post. See also this WebMasters post which explains the difference between Accept and Content-Type HTTP headers.

Since you only seem to be interested in scraping the iframe tags, you may simply omit the headers argument entirely and you should see the results if you print out the test object in your code.

import requests
from bs4 import BeautifulSoup

page = requests.get("https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances" + "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000", timeout=10)
soup = BeautifulSoup(page.content, "lxml")
test = soup.find_all('iframe')

for tag in test:
    print(tag)
Community
  • 1
  • 1
Mihai Chelaru
  • 7,614
  • 14
  • 45
  • 51
  • if I have to pass headers as {"content-type":"text"}, what would I put in for "content-type" and "text" in my case? – j.doe Apr 24 '18 at 03:28
  • See [this post](https://stackoverflow.com/questions/23714383/what-are-all-the-possible-values-for-http-content-type-header) on Content-Type headers. From what I gather they tell the server what type of response you expect. Since you seem to just be interested in scraping the contents of the 'iframe' tag you can just omit the `headers` argument. I've edited my response to reflect this. – Mihai Chelaru Apr 24 '18 at 03:48
  • 1
    @MihaiChelaru you're just overkilling it with the explanations. If the OP knew how to read documentation he probably wouldn't be on StackOverflow, spare him the details and just tell him what's wrong with his actual code. i.e. `USER_AGENT` is a variable and he's missing it. He can either remove it as it's not necessary or add a fake user agent. Thank you for your contributions @Mihai !! – innicoder Apr 24 '18 at 05:59
  • @Elivir Thanks for the feedback. I will try to be more concise and to the point in my answers. I'm still new on here so I'm still learning what makes a good answer. – Mihai Chelaru Apr 24 '18 at 13:11
  • @MihaiChelaru Exactly why I was only advising, you're doing great and I do appreciate your participation in helping people with their issues. Perhaps your knowledge is more advanced and this subject seems like an easy matter to you, that's why you have to simplify it as much as you can. Mostly short answers and a LINK (and/ or a example) will suffice. Take care! – innicoder Apr 24 '18 at 13:20
1

We have to provide a user-agent, HERE's a link to the fake user-agents.

import requests
from bs4 import BeautifulSoup


USER_AGENT = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/53'}
url = "https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances"
token = "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000"


page = requests.get(url + token, headers=USER_AGENT, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all('iframe')

You can simply NOT use a User Agent, Code:

import requests
from bs4 import BeautifulSoup


url = "https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances"
token = "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000"


page = requests.get(url + token, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all('iframe')

I've separated your URL for readability purposes into the URL and token. That's why there's two variables URL and token

innicoder
  • 2,612
  • 3
  • 14
  • 29