How to bypass Recaptcha for BeautifulSoup in Python?

Question

I wish to get data from https://www.example.com using BeautifulSoup (BS4) as

req = requests.get('https://www.example.com/')
soup = BeautifulSoup(req.text, "lxml")
cDF = soup.find('div',attrs={"id" : "nav-tabContent"}).find(
    'table',attrs={"id" : "main_table_countries_today"}).find_all('tr')

I find an error

cDF = soup.find('div',attrs={"id" : "nav-tabContent"}).find(
AttributeError: 'NoneType' object has no attribute 'find'

When I debugged soup, I find that it is actually showing that it is stuck by the recaptcha page of Cloudflare.

I checked other similar questions while I found most are with zero answers. One has been answered (2 replies) to check for a particular bot test which is not relevant to my query. Therefore, I believe that this question must not be marked as repeat question.

Please tell me how may I get data for my analysis bypassing recaptcha. By the way, I use privacypass in google chrome in ubuntu. Thanks.

I see -ve flag without comment. If I see the reason(s) then I may rectify myself. If I know everything then I don't need to come here and post my query. A learned person can teach without insulting others without reason. This is not an exam testbed. — vega, Jun 03 '20 at 20:05

score 1 · Answer 1 · answered Jun 03 '20 at 20:04

1

Try changing the user-agent header. For example, it works ok with curl, so that there arent any advanced protections.

answered Jun 03 '20 at 20:04

Andrew

1,037
9
17

Can I use curl with python? Can you show a small example? (https://stackoverflow.com/questions/25491090/how-to-use-python-to-execute-a-curl-command) shows using requests. – vega Jun 03 '20 at 20:07
I tested with the snippets `import pycurl from io import BytesIO bOBJ = BytesIO() crl = pycurl.Curl() crl.setopt(crl.URL, 'https://www.worldometers.info/coronavirus/') crl.setopt(crl.WRITEDATA, bOBJ) crl.perform() crl.close() getBody = bOBJ.getvalue() print(getBody)` which yields error `b'error code: 1010'`. – vega Jun 03 '20 at 20:38
I used 'Mozilla/5.0 (Windows; U; Windows NT 6.1; it; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 (.NET CLR 3.5.30729)' USERAGENT and I see only cloudfare recaptcha site. So your advice is NOT working. – vega Jun 04 '20 at 08:09

How to bypass Recaptcha for BeautifulSoup in Python?

1 Answers1