0

I'm a beginner to web scraping. I'm attempting to scrape from this website. Except when I attempt to get some information in the following td element there is a text field missing, but this is there on the website when I look at its source.

Below is the code returned from the beautiful soup parser. On the webpage however there is a string put on right after the tag closes. I would like to be able to scrape this string, how would I do that

<td style="text-align:left; font-weight:bold;"><script type="text/javascript">document.write(Base64.decode(str_rot13("ZGL3Ywx5YwR1YwR2AN==")))</script></td>

Here is what is on the webpage

<td style="text-align:left; font-weight:bold;"><script type="text/javascript">document.write(Base64.decode(str_rot13("ZGDjYwVjAF4lZwVhZj==")))</script>140.205.222.3</td>

My question is why does this appear in the webpage source but not in the beautiful soup text & how would I go about obtaining this information?

mayankmehtani
  • 435
  • 4
  • 14
  • 1
    Possible duplicate of [Wait page to load before getting data with requests.get in python 3](https://stackoverflow.com/questions/45448994/wait-page-to-load-before-getting-data-with-requests-get-in-python-3) – MT-FreeHK Aug 12 '18 at 05:33

1 Answers1

0

You don't see the text because BeautifulSoup doesn't run javascript, it just parses html text. You must use Selenium or headless browser and execute javascript on that page to obtain the text. However, this simple javascript function you can emulate in Python too (with help of Short rot13 function - Python):

data = '''
<td style="text-align:left; font-weight:bold;">
    <script type="text/javascript">document.write(Base64.decode(str_rot13("ZGDjYwVjAF4lZwVhZj==")))</script>
</td>'''

from bs4 import BeautifulSoup
import re
import base64

rot13 = str.maketrans(
    "ABCDEFGHIJKLMabcdefghijklmNOPQRSTUVWXYZnopqrstuvwxyz",
    "NOPQRSTUVWXYZnopqrstuvwxyzABCDEFGHIJKLMabcdefghijklm")

soup = BeautifulSoup(data, 'lxml')
encoded_string = re.search(r'str_rot13\("(.*?)"\)', str(soup.find('script')))[1]
decoded_string = base64.b64decode(encoded_string.translate(rot13)).decode('utf-8')

print(decoded_string)

This prints the decoded string:

140.205.222.3
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • What's the reasoning behind storing the ip address in the form 'ZGL3Ywx5YwR1YwR2AN==' rather than just having the address. – mayankmehtani Aug 12 '18 at 17:32