Python find all in text with regex

Question

I'm trying to parse web-site and get 4 URLs for video files. Links example: https://cs510400.vk.me/3/u381845574/videos/e8f1419d5b.720.mp4

First i grub HTML code and find tag witch contains my links. And find current line with my links.

My code:

# coding: utf-8
import requests
from bs4 import BeautifulSoup

import re


r = requests.get('https://vk.com/video-63758929_456249306')

soup = BeautifulSoup(r.content,'lxml')
scripts = soup.find_all('script')
current_tag = scripts[-1].string




links = re.findall('^.*source.*$',current_tag,re.MULTILINE)
current_line = []
for x in links:
    current_line.append(x)

print(current_line)

I got this result:

[u'ajax.preload(\'al_video.php\', {"act":"show","video":"-63758929_456249306","module":"direct"}, ["\u041d\u0435\u043c\u043d\u043e\u0433\u043e \u043f\u043e\u0442\u0430\u0441\u043a\u0443\u0445\u0430","<div id=\\"video_box_wrap-63758929_456249306\\" class=\\"video_box_wrap\\">\\n  <video id=\\"video_player\\" poster=\\"https:\\/\\/pp.vk.me\\/c836534\\/v836534929\\/16e40\\/DWpFw6tiZDQ.jpg\\" preload=\\"none\\" controls  onplaying=\\"cur.incViews && cur.incViews()\\">\\n    <source src=\\"https:\\/\\/cs510400.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.720.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source><source src=\\"https:\\/\\/cs510400.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.480.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source><source src=\\"https:\\/\\/cs510603.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.360.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source><source src=\\"https:\\/\\/cs510603.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.240.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source>\\n    <div class=\\"video_box_background\\" style=\\"background-image:url(https:\\/\\/pp.vk.me\\/c836534\\/v836534929\\/16e40\\/DWpFw6tiZDQ.jpg);\\"><\\/div>\\n    <div class=\\"video_box_cant_play\\">\u0414\u0430\u043d\u043d\u043e\u0435 \u0432\u0438\u0434\u0435\u043e \u043d\u0435 \u043c\u043e\u0436\u0435\u0442 \u0431\u044b\u0442\u044c \u043f\u0440\u043e\u0438\u0433\u0440\u0430\u043d\u043e \u043d\u0430 \u044d\u0442\u043e\u043c \u0443\u0441\u0442\u0440\u043e\u0439\u0441\u0442\u0432\u0435<\\/div>\\n  <\\/video>\\n<\\/div>","\\naddTemplates({\\"_\\":\\"_\\",\\"audio_row\\":\\"<div class=\\\\\\"audio_row _audio_row _audio_row_%1%_%0% %cls% clear_fix\\\\\\" onclick=\\\\\\"return getAudioPlayer().toggleAudio(this, event)\\\\\\" data-audio=\\\\\\"%serialized%\\\\\\" data-full-id=\\\\\\"%1%_%0%\\\\\\" id=\\\\\\"audio_%1%_%0%\\\\\\">\\\\n  <div class=\\\\\\"audio_play_wrap\\\\\\" data-nodrag=\\\\\\"1\\\\\\"><button class=\\\\\\"audio_play _audio_play\\\\\\" id=\\\\\\"play_%1%_%0%\\\\\\" aria-label=\\\\\\"\\\\\\"><\\\\\\/button><\\\\\\/div>\\\\n  <div class=\\\\\\"audio_info\\\\\\">\\\\n    <div class=\\\\\\"audio_duration_wrap _audio_duration_wrap\\\\\\">\\\\n      <div class=\\\\\\"audio_hq_label\\\\\\"><\\\\\\/div>\\\\n      <div class=\\\\\\"audio_duration _audio_duration\\\\\\">%duration%<\\\\\\/div>\\\\n      <div class=\\\\\\"audio_acts\\\\\\">\\\\n        <div class=\\\\\\"audio_act\\\\\\" id=\\\\\\"recom\\\\\\" onmouseover=\\\\\\"audioShowActionTooltip(this, \'%1%_%0%\')\\\\\\" onclick=\\\\\\"AudioPage(this).showRecoms(this, \'%1%_%0%\', event)\\\\\\"><div><\\\\\\/div><\\\\\\/d
...

But i need only my 4 links. What i'm doing wrong? How to get only links from this large tag?

This answer might be useful: http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup You can use SoupStrainer to get the relevant elements in the page — smernst, Nov 07 '16 at 10:58
for link in BeautifulSoup(r.content,'html.parser',parseOnlyThese=SoupStrainer('script')): if link.has_attr('source'): print link['source'] - doesn't work for me. It only can find scripts src, but not attr inside hardcoded script. — Konstantin Rusanov, Nov 07 '16 at 11:14

Ibrahim · Accepted Answer · 2016-11-08T08:47:47.783

I included your results as a string, and added Regex to extract the urls.

Regex:

(?<=src\=\\\")(https:\\\/\\\/c[\s\S]*?mp4)

Regex Demo: https://regex101.com/r/GDMBqH/2

When using Regex in python, no need to escape \

Python Code:

import re
results = '''[u'ajax.preload(\'al_video.php\', {"act":"show","video":"-63758929_456249306","module":"direct"}, ["\u041d\u0435\u043c\u043d\u043e\u0433\u043e \u043f\u043e\u0442\u0430\u0441\u043a\u0443\u0445\u0430","<div id=\\"video_box_wrap-63758929_456249306\\" class=\\"video_box_wrap\\">\\n  <video id=\\"video_player\\" poster=\\"https:\\/\\/pp.vk.me\\/c836534\\/v836534929\\/16e40\\/DWpFw6tiZDQ.jpg\\" preload=\\"none\\" controls  onplaying=\\"cur.incViews && cur.incViews()\\">\\n    <source src=\\"https:\\/\\/cs510400.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.720.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source><source src=\\"https:\\/\\/cs510400.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.480.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source><source src=\\"https:\\/\\/cs510603.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.360.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source><source src=\\"https:\\/\\/cs510603.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.240.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source>\\n    <div class=\\"video_box_background\\" style=\\"background-image:url(https:\\/\\/pp.vk.me\\/c836534\\/v836534929\\/16e40\\/DWpFw6tiZDQ.jpg);\\"><\\/div>\\n    <div class=\\"video_box_cant_play\\">\u0414\u0430\u043d\u043d\u043e\u0435 \u0432\u0438\u0434\u0435\u043e \u043d\u0435 \u043c\u043e\u0436\u0435\u0442 \u0431\u044b\u0442\u044c \u043f\u0440\u043e\u0438\u0433\u0440\u0430\u043d\u043e \u043d\u0430 \u044d\u0442\u043e\u043c \u0443\u0441\u0442\u0440\u043e\u0439\u0441\u0442\u0432\u0435<\\/div>\\n  <\\/video>\\n<\\/div>","\\naddTemplates({\\"_\\":\\"_\\",\\"audio_row\\":\\"<div class=\\\\\\"audio_row _audio_row _audio_row_%1%_%0% %cls% clear_fix\\\\\\" onclick=\\\\\\"return getAudioPlayer().toggleAudio(this, event)\\\\\\" data-audio=\\\\\\"%serialized%\\\\\\" data-full-id=\\\\\\"%1%_%0%\\\\\\" id=\\\\\\"audio_%1%_%0%\\\\\\">\\\\n  <div class=\\\\\\"audio_play_wrap\\\\\\" data-nodrag=\\\\\\"1\\\\\\"><button class=\\\\\\"audio_play _audio_play\\\\\\" id=\\\\\\"play_%1%_%0%\\\\\\" aria-label=\\\\\\"\\\\\\"><\\\\\\/button><\\\\\\/div>\\\\n  <div class=\\\\\\"audio_info\\\\\\">\\\\n    <div class=\\\\\\"audio_duration_wrap _audio_duration_wrap\\\\\\">\\\\n      <div class=\\\\\\"audio_hq_label\\\\\\"><\\\\\\/div>\\\\n      <div class=\\\\\\"audio_duration _audio_duration\\\\\\">%duration%<\\\\\\/div>\\\\n      <div class=\\\\\\"audio_acts\\\\\\">\\\\n        <div class=\\\\\\"audio_act\\\\\\" id=\\\\\\"recom\\\\\\" onmouseover=\\\\\\"audioShowActionTooltip(this, \'%1%_%0%\')\\\\\\" onclick=\\\\\\"AudioPage(this).showRecoms(this, \'%1%_%0%\', event)\\\\\\"><div><\\\\\\/div><\\\\\\/d'''
for m in re.finditer(r"(https:\\/\\/c[\s\S]*?mp4)", results):
    print('%s' % (m.group(0)))

Demo https://repl.it/EQkR/1

@KonstantinRusanov Try this for the regex: `(?<=src\=\\\")(https:\\\/\\\/c[\s\S]*?mp4)` Check out the results here: https://regex101.com/r/GDMBqH/2 — Ibrahim, Nov 08 '16 at 00:22
As for Python, the online Python website doesn't seem to support `BeautifulSoup`. — Ibrahim, Nov 08 '16 at 00:39

Python find all in text with regex

1 Answers1