python finding embedded mp4 file with Beautifulsoup

Question

I am new to bs4!

I have looked up many tutorials but nothing will work... I want to scrape the mp4 file from a site but the embedded stuff looks different than on the tutorials... I have tried the find and find_all function but cant get it to work. Can anyone help?

<div class="rmp-playlist-container">
<div class="rmp-playlist-player-wrapper">
<div id="rmpPlayer"></div>
</div>
</div>
<p><script>var playlistData = [{src: {mp4:["https://wantedurl.mp4"]},"contentMetadata": {"title": "video1",   "thumbnail":"https://somethumbnail.jpg","poster": [   "https://someposter.jpg"]}

current code:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    'From': 'tikkanenfelix@gmail.com'  # This is another valid field
}

base_url = "url"

r = requests.get(base_url,headers=headers)

patt = re.compile(r'mp4:\s*\["(.+?)"\]')
soup = BeautifulSoup(r, 'html.parser')
print(soup)

for e in soup.find_all('script'):
    m = patt.search(e.string)
    if m:
        print(m.group(1))

Please include your code attempts – William Baker Morrison Jan 26 '21 at 15:33 — William Baker Morrison, Jan 26 '21 at 15:33

score 0 · Answer 1 · answered Jan 26 '21 at 15:42

0

You can try regular expressions to parse the javascript text.

from bs4 import BeautifulSoup
import re

patt = re.compile(r'mp4:\s*\["(.+?)"\]')

data = '''\
<div class="rmp-playlist-container">
<div class="rmp-playlist-player-wrapper">
<div id="rmpPlayer"></div>
</div>
</div>
<p><script>var playlistData = [{src: {mp4:["https://wantedurl.mp4"]},"contentMetadata": {"title": "video1",   "thumbnail":"https://somethumbnail.jpg","poster": [   "https://someposter.jpg"]}}];
</script>
'''

soup = BeautifulSoup(data, 'html.parser')

for e in soup.find_all('script'):
    m = patt.search(e.string)
    if m:
        print(m.group(1))

answered Jan 26 '21 at 15:42

i get an error message that looks like this: m = patt.search(e.string) TypeError: expected string or bytes-like object – Felix T Jan 26 '21 at 16:05
What can I say without seeing your code? The above answer works on my machine. You must be doing something else. – Jan 26 '21 at 16:09
Here's a [demo](https://repl.it/@JustinEzequiel/bs4ViciousSystemsoftware). – Jan 26 '21 at 16:10
Do I have to output the source code of the site into a string for it to be able to work? – Felix T Jan 26 '21 at 16:12
No, you do not. So long as you load it into BeautifulSoup. Post your code, why don't you? – Jan 26 '21 at 16:14
Refusing to post any code makes it much more difficult to help you. – Jan 26 '21 at 16:14
Print out `r.content`. You may be getting different content than what your browser gets. Try adding a User-Agent header to your request. – Jan 26 '21 at 16:22
And if you are getting errors, do not paraphrase but paste the full traceback as part of your question. – Jan 26 '21 at 16:23
See https://stackoverflow.com/questions/10606133/sending-user-agent-using-requests-library-in-python – Jan 26 '21 at 16:31
still getting an error: Traceback (most recent call last): File "C:/python/venv/videodownloader.py", line 15, in soup = BeautifulSoup(r, 'html.parser') File "C:\python\venv\lib\site-packages\bs4\__init__.py", line 307, in __init__ elif len(markup) <= 256 and ( TypeError: object of type 'Response' has no len() – Felix T Jan 26 '21 at 16:35
tried with r.text aswell and got this: Traceback (most recent call last): File "C:/python/venv/videodownloader.py", line 19, in m = patt.search(e.string) TypeError: expected string or bytes-like object – Felix T Jan 26 '21 at 16:36
Easier if you can share the URL instead of hiding that information from us. Print out `r.status_code` and `r.reason`. Did I not ask you to print out `r.content` and check if you are getting the HTML you expect? – Jan 26 '21 at 16:38
r.reason says OK and r.status_code says 200. thanks for your help :) ill try and figure it out – Felix T Jan 26 '21 at 16:46

score 0 · Accepted Answer · answered Jan 26 '21 at 15:42

You can use regular expression to find all the links

import re
text = """
<div class="rmp-playlist-container">
<div class="rmp-playlist-player-wrapper">
<div id="rmpPlayer"></div>
</div>
</div>
<p><script>var playlistData = [{src: {mp4:["https://wantedurl.mp4"]},"contentMetadata": {"title": "video1",   "thumbnail":"https://somethumbnail.jpg","poster": [   "https://someposter.jpg"]}
"""
soup = BeautifulSoup(text)
re.findall("https.*.mp4", soup.script.string)

python finding embedded mp4 file with Beautifulsoup

2 Answers2