I recently created a very basic regex (new to it) which detects all strings of a javascript file:
with open("file.js", "r", encoding="UTF-8") as file:
re.findall(r"(\".+\"|\'.+\')", file.read())
This worked perfectly when I created my costum js file. As soon as I started trying it out with webscraping, it wouldn't work anymore.
Following Html:
<html>
<body>
<script src="/modules.cb8e9af2c2709a34b49b.js"></script>
<script src="/watch.4c4d39803b119ef010a3.js"></script>
<script src="/common.acad5df36574c2182d15.js"></script>
<script src="/reward4823ace7ccd.js"></script>
<script src="/polyfills.2b2696c6c54a9388e1d4.js"></script>
<script src="/index.a5be217e620cedc065e5.js"></script>
</body>
</html>
would output
['/modules.cb8e9af2c2709a34b49b.js"></script><script src="/watch.4c4d39803b119ef010a3.js">
</script><script src="/common.acad5df36574c2182d15.js"></script><script src="/reward4823ace7ccd.js">
</script><script src="/polyfills.2b2696c6c54a9388e1d4.js"></script><script src="/index.a5be217e620cedc065e5.js']
when I accessed the real website code via
r = requests.get(link)
re.findall(r"(\".+\"|\'.+\')", str(BeautifulSoup(r.text, "html.parser")))
But when I wrote the html into my costum file and tried it with the first code, it would correctly output
['/modules.cb8e9af2c2709a34b49b.js', '/watch.4c4d39803b119ef010a3.js', '/common.acad5df36574c2182d15.js',
'/reward4823ace7ccd.js', '/polyfills.2b2696c6c54a9388e1d4.js', '/index.a5be217e620cedc065e5.js']
even though both times the type of data regex should read were strings. I already tried not converting aynthing or intentionally converting everything into strings, still the output was always the same.
Why is that?
Also, if it helps, here is the (test)link I'm scraping (sure the html is more complex, still though that shouldnt change regex's behaviour in that situation: "https://lolesports.com/schedule?leagues=european-masters,lcs,lck"