how to get only text after between match groups with regex from html youtube page

Question

I am using screaming frog to scrape youtube video keywords. I know the software displays a tab that captures exactly that meta info but it only shows 160 characters, so videos with a bigger volume of keywords do not show there.

Anyway, I also tried using CSS selectors and Xpath through the custom extraction feature on the software, but did not get anything.

The last thing I can think of is using a regex in the custom extraction to capture and extract the keywords straight from the html page.

This is the part where the keywords appear:

    <meta property="og:video:tag" content="lanshow">
    <meta property="og:video:tag" content="lanshow ep04">
    <meta property="og:video:tag" content="lanshow episodio 4">
    <meta property="og:video:tag" content="lanshow 4">
    <meta property="og:video:tag" content="directo unboxme">
    <meta property="og:video:tag" content="directo tecnologia">
    <meta property="og:video:tag" content="directo hardware">
    <meta property="og:video:tag" content="directo preguntas y respuestas">
    <meta property="og:video:tag" content="preguntas y respuestas unboxme">

They also appear enumerated one after another further down like so:

"keywords":"lanshow,lanshow ep04,lanshow episodio 4,lanshow 4,directo unboxme,directo tecnologia,directo hardware,directo preguntas y respuestas,preguntas y respuestas unboxme","c":"WEB","player_response":"{\"videoDetails\":{\"thumbnail\":{\"thumbnails...

Is there a way to capture only the keywords, using regex, capture groups or something of the sort?

I have tried different regex combinations but I get the whole text and even the whole remaining text of the html appears on the extraction.

This gests only the first keyword: video:tag"content=.*?>

I also tried another regex that extracted the whole html text after the first keyword. I need to find a way to tell the extractor to find the before and after delimiters and ignore them on the extraction to get only what is in between (the actual keywords).

This is the before delimiter:

This is the after delimiter: ">

Is there a way to do that?

Thank you.

Update your question with everything you've tried – Andersson Jun 05 '17 at 06:43 — Andersson, Jun 05 '17 at 06:43

score 0 · Answer 1 · answered Jun 14 '17 at 23:13

The XPath expression //meta[@property='og:video:tag']/attribute::content should get you all the relevant info.

Here's Python snippet, as I'm not familiar with Screaming Frog:

import requests
import lxml.html 
doc = lxml.html.parse('yt.html')
meta_tags = doc.xpath("//meta[@property='og:video:tag']/attribute::content")
for content in meta_tags:
    print content

Alternatively (parsing HTML with regex is bad and might lead to unwanted results): The easiest regular expression I could think of matches the HTML meta tags. The specifics, such as special characters and flags, may be different in your programming language or tool of choice, but this should work in many:

<meta property="og:video:tag" content="(.+?)">

In a Python script:

import re
import requests

match_metas = re.compile('<meta property="og:video:tag" content="(.+?)">')
result = requests.get('https://www.youtube.com/watch?v=HHMdrAhVbLo')

print match_metas.findall(result.content)

Result:

['unboxing en directo', 'unboxing mionix color', 'rx 580', 'talk show', 'lanshow', 'lanshow ep04', 'lanshow episodio 4', 'lanshow 4', 'directo unboxme', 'directo tecnologia', 'directo hardware', 'directo preguntas y respuestas', 'preguntas y respuestas unboxme']

how to get only text after between match groups with regex from html youtube page

1 Answers1