-1

I am new to bs4!

I have looked up many tutorials but nothing will work... I want to scrape the mp4 file from a site but the embedded stuff looks different than on the tutorials... I have tried the find and find_all function but cant get it to work. Can anyone help?

<div class="rmp-playlist-container">
<div class="rmp-playlist-player-wrapper">
<div id="rmpPlayer"></div>
</div>
</div>
<p><script>var playlistData = [{src: {mp4:["https://wantedurl.mp4"]},"contentMetadata": {"title": "video1",   "thumbnail":"https://somethumbnail.jpg","poster": [   "https://someposter.jpg"]}

current code:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    'From': 'tikkanenfelix@gmail.com'  # This is another valid field
}

base_url = "url"

r = requests.get(base_url,headers=headers)

patt = re.compile(r'mp4:\s*\["(.+?)"\]')
soup = BeautifulSoup(r, 'html.parser')
print(soup)

for e in soup.find_all('script'):
    m = patt.search(e.string)
    if m:
        print(m.group(1))


Felix T
  • 15
  • 5

2 Answers2

0

You can try regular expressions to parse the javascript text.

from bs4 import BeautifulSoup
import re

patt = re.compile(r'mp4:\s*\["(.+?)"\]')

data = '''\
<div class="rmp-playlist-container">
<div class="rmp-playlist-player-wrapper">
<div id="rmpPlayer"></div>
</div>
</div>
<p><script>var playlistData = [{src: {mp4:["https://wantedurl.mp4"]},"contentMetadata": {"title": "video1",   "thumbnail":"https://somethumbnail.jpg","poster": [   "https://someposter.jpg"]}}];
</script>
'''

soup = BeautifulSoup(data, 'html.parser')

for e in soup.find_all('script'):
    m = patt.search(e.string)
    if m:
        print(m.group(1))

  • i get an error message that looks like this: m = patt.search(e.string) TypeError: expected string or bytes-like object – Felix T Jan 26 '21 at 16:05
  • What can I say without seeing your code? The above answer works on my machine. You must be doing something else. –  Jan 26 '21 at 16:09
  • Here's a [demo](https://repl.it/@JustinEzequiel/bs4ViciousSystemsoftware). –  Jan 26 '21 at 16:10
  • Do I have to output the source code of the site into a string for it to be able to work? – Felix T Jan 26 '21 at 16:12
  • No, you do not. So long as you load it into BeautifulSoup. Post your code, why don't you? –  Jan 26 '21 at 16:14
  • Refusing to post any code makes it much more difficult to help you. –  Jan 26 '21 at 16:14
  • Print out `r.content`. You may be getting different content than what your browser gets. Try adding a User-Agent header to your request. –  Jan 26 '21 at 16:22
  • And if you are getting errors, do not paraphrase but paste the full traceback as part of your question. –  Jan 26 '21 at 16:23
  • See https://stackoverflow.com/questions/10606133/sending-user-agent-using-requests-library-in-python –  Jan 26 '21 at 16:31
  • still getting an error: Traceback (most recent call last): File "C:/python/venv/videodownloader.py", line 15, in soup = BeautifulSoup(r, 'html.parser') File "C:\python\venv\lib\site-packages\bs4\__init__.py", line 307, in __init__ elif len(markup) <= 256 and ( TypeError: object of type 'Response' has no len() – Felix T Jan 26 '21 at 16:35
  • tried with r.text aswell and got this: Traceback (most recent call last): File "C:/python/venv/videodownloader.py", line 19, in m = patt.search(e.string) TypeError: expected string or bytes-like object – Felix T Jan 26 '21 at 16:36
  • Easier if you can share the URL instead of hiding that information from us. Print out `r.status_code` and `r.reason`. Did I not ask you to print out `r.content` and check if you are getting the HTML you expect? –  Jan 26 '21 at 16:38
  • r.reason says OK and r.status_code says 200. thanks for your help :) ill try and figure it out – Felix T Jan 26 '21 at 16:46
0

You can use regular expression to find all the links

import re
text = """
<div class="rmp-playlist-container">
<div class="rmp-playlist-player-wrapper">
<div id="rmpPlayer"></div>
</div>
</div>
<p><script>var playlistData = [{src: {mp4:["https://wantedurl.mp4"]},"contentMetadata": {"title": "video1",   "thumbnail":"https://somethumbnail.jpg","poster": [   "https://someposter.jpg"]}
"""
soup = BeautifulSoup(text)
re.findall("https.*.mp4", soup.script.string)
Epsi95
  • 8,832
  • 1
  • 16
  • 34