When running BeautifulSoup's diagnose()
over your data I obtain the following info:
data = '''
<script language="JavaScript">
if(top.frames.length != 0) {
location.href="frame_break.jsp"
}
</script>
<html>
<body>
</body>
</html>
<script>
var err='User ID';
alert(err);
iBankForm.action='login.jsp';
iBankForm.submit();
</script>'''
from bs4.diagnose import diagnose
diagnose(data)
Prints:
Diagnostic running on Beautiful Soup 4.7.1
Python version 3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
Found lxml version 4.3.3.0
Found html5lib version 1.0.1
Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<script language="JavaScript">
if(top.frames.length != 0) {
location.href="frame_break.jsp"
}
</script>
<html>
<body>
</body>
</html>
<script>
var err='User ID';
alert(err);
iBankForm.action='login.jsp';
iBankForm.submit();
</script>
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
<head>
<script language="JavaScript">
if(top.frames.length != 0) {
location.href="frame_break.jsp"
}
</script>
</head>
<body>
<script>
var err='User ID';
alert(err);
iBankForm.action='login.jsp';
iBankForm.submit();
</script>
</body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
<head>
<script language="JavaScript">
if(top.frames.length != 0) {
location.href="frame_break.jsp"
}
</script>
</head>
<body>
</body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<script language="JavaScript">
if(top.frames.length != 0) {
location.href="frame_break.jsp"
}
</script>
--------------------------------------------------------------------------------
From this I can see, the lxml
parser will not parse the last <script>
so you never reach it through BeautifulSoup. The solution is different parser, e.g. html.parser
:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for script in soup.select('script:contains(alert)'):
alert = re.findall(r'(?<=alert\().+(?=\))', script.text)
print(alert)
Prints:
['err']