Scrape alert text from window alert when alert is shown

Question

I am using python requests library and BeautifulSoup. There is one URL when the requests is not valid it returns HTML with alert() pop up. The problem in Beautifulsoup is I cannot get the window.alert pop up text.

I have tried using the regex method from this answer but it doesn't seem to work.

Thus when doing:

for script in soup.find_all("script"):
    alert = re.findall(r'(?<=alert\(\").+(?=\")', script.text)

The script never gets the executed script.

This is the script that I am extracting:

<script language="JavaScript">
if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
</script>

<html>
<body>

</body>
</html>


<script>
    var err='User ID';
    alert(err);
    iBankForm.action='login.jsp';
    iBankForm.submit();
  </script>

I am expecting to get the alert text which is User ID. I notice if I have tag that I can't grab the script down below If I remove or move the script into the body tag then I can get the

<script>
    var err='User ID';
    alert(err);
    iBankForm.action='login.jsp';
    iBankForm.submit();
  </script>

Relevant: https://stackoverflow.com/questions/54948405/capture-javascript-alert-text-using-beautifulsoup — Joao Pereira, Jul 12 '19 at 16:47
@Fozoro its different, due to nature of the html is written, it is not able to get the alert I have tried the other answer on my test, it works but not in this html structure — Salis, Jul 12 '19 at 17:13
in that answer, extract() is called for the script found through the usage of the find() method. have you tried calling the extract() function for each script instance inside the loop? — Joao Pereira, Jul 12 '19 at 17:48
it's outside of the HTML tags so won't be in soup. Examine the html and see if you can add in a lookaround to isolate the right var — QHarr, Jul 12 '19 at 18:57

score 1 · Accepted Answer · answered Jul 12 '19 at 19:15

It is solved by using html5lib parser library If you read the documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/ it Parses pages the same way a web browser does So it will be able to get the script outside body tag

soup = BeautifulSoup(payload, 'html5lib')
        errors = None
        for scr in soup.find_all("script"):
            scrExtract = scr.extract()
            alert = re.findall('err="(.*\w)', scrExtract.text)
            if len(alert) > 0:
                errors = alert[0]

        print(errors)

score 1 · Answer 2 · answered Jul 12 '19 at 20:40

When running BeautifulSoup's diagnose() over your data I obtain the following info:

data = '''
<script language="JavaScript">
if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
</script>

<html>
<body>

</body>
</html>


<script>
    var err='User ID';
    alert(err);
    iBankForm.action='login.jsp';
    iBankForm.submit();
  </script>'''

from bs4.diagnose import diagnose

diagnose(data)

Prints:

Diagnostic running on Beautiful Soup 4.7.1
Python version 3.6.8 (default, Jan 14 2019, 11:02:34) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
Found lxml version 4.3.3.0
Found html5lib version 1.0.1

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<script language="JavaScript">
 if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
</script>
<html>
 <body>
 </body>
</html>
<script>
 var err='User ID';
    alert(err);
    iBankForm.action='login.jsp';
    iBankForm.submit();
</script>
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
 <head>
  <script language="JavaScript">
   if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
  </script>
 </head>
 <body>
  <script>
   var err='User ID';
    alert(err);
    iBankForm.action='login.jsp';
    iBankForm.submit();
  </script>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
 <head>
  <script language="JavaScript">
   if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
  </script>
 </head>
 <body>
 </body>
</html>

--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<script language="JavaScript">
 if(top.frames.length != 0) {
    location.href="frame_break.jsp"
}
</script>
--------------------------------------------------------------------------------

From this I can see, the lxml parser will not parse the last <script> so you never reach it through BeautifulSoup. The solution is different parser, e.g. html.parser:

import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')


for script in soup.select('script:contains(alert)'):
    alert = re.findall(r'(?<=alert\().+(?=\))', script.text)
    print(alert)

Prints:

['err']

Scrape alert text from window alert when alert is shown

2 Answers2