-3

I recently created a very basic regex (new to it) which detects all strings of a javascript file:

with open("file.js", "r", encoding="UTF-8") as file:
    re.findall(r"(\".+\"|\'.+\')", file.read())

This worked perfectly when I created my costum js file. As soon as I started trying it out with webscraping, it wouldn't work anymore.

Following Html:

<html>
  <body>
    <script src="/modules.cb8e9af2c2709a34b49b.js"></script>
    <script src="/watch.4c4d39803b119ef010a3.js"></script>
    <script src="/common.acad5df36574c2182d15.js"></script>
    <script src="/reward4823ace7ccd.js"></script>
    <script src="/polyfills.2b2696c6c54a9388e1d4.js"></script>
    <script src="/index.a5be217e620cedc065e5.js"></script>
  </body>
</html>   

would output

['/modules.cb8e9af2c2709a34b49b.js"></script><script src="/watch.4c4d39803b119ef010a3.js">
</script><script src="/common.acad5df36574c2182d15.js"></script><script src="/reward4823ace7ccd.js"> 
</script><script src="/polyfills.2b2696c6c54a9388e1d4.js"></script><script src="/index.a5be217e620cedc065e5.js']

when I accessed the real website code via

r = requests.get(link)
re.findall(r"(\".+\"|\'.+\')", str(BeautifulSoup(r.text, "html.parser")))

But when I wrote the html into my costum file and tried it with the first code, it would correctly output

['/modules.cb8e9af2c2709a34b49b.js', '/watch.4c4d39803b119ef010a3.js', '/common.acad5df36574c2182d15.js', 
'/reward4823ace7ccd.js', '/polyfills.2b2696c6c54a9388e1d4.js', '/index.a5be217e620cedc065e5.js']

even though both times the type of data regex should read were strings. I already tried not converting aynthing or intentionally converting everything into strings, still the output was always the same.

Why is that?

Also, if it helps, here is the (test)link I'm scraping (sure the html is more complex, still though that shouldnt change regex's behaviour in that situation: "https://lolesports.com/schedule?leagues=european-masters,lcs,lck"

thoerni
  • 535
  • 5
  • 21
  • Your question takes it for granted that the value returned by `str(BeautifulSoup(r.text, "html.parser"))` is the same as the HTML file that you present. But that assumption is apparently not giving the expected results. That suggests that the two are not as similar as you think. – BoarGules Aug 15 '20 at 13:33
  • It wasnt, as I also said in my last paragraph, still I expected it to work as the string of characters shoudlve been the same only with some text around it. As I found out, the formatting of the html was (and is) my problem (as I answered below). – thoerni Aug 15 '20 at 13:58
  • 1
    `+` is a greedy, use `+?` or `*?` lazy quantifiers. – Wiktor Stribiżew Aug 15 '20 at 14:01

1 Answers1

-1

The regex "(\".+\"|\'.+\')" goes on matching any character between 2 " or '. This includes for example both - "/modules.cb8e9af2c2709a34b49b.js" and "></script><script src=" in the following substring:

<script src="/modules.cb8e9af2c2709a34b49b.js"></script><script src="...

You should either use re.finditer or add restrictions to your regex instead of .+

Kunal Kukreja
  • 737
  • 4
  • 18
  • I actually found the problem (though I dont know a solution yet): The code kinda works, but only if the strings are in different lines. As long as they are in the same line it wont work as intendet and the whole string will go from the first found match to the last one in the same line. (which also explains why my example wasnt working, as the website reffers to all the scripts in one line while i formatted the code in the editro) – thoerni Aug 15 '20 at 13:55