-4

hi I have a string as http://www.yifysubtitles.com/subtitles/blockers2018720pwebripx264-ytsam-arabic-128849"><span class="text-muted">subtitle</span> Blockers.2018.720p.WEBRip.x264-[YTS.AM]</a></td><td class="other-cell"></td><td class="uploader-cell"><a href="/user/SHINAWY">SHINAWY</a></td><td class="download-cell"><a href="/subtitles/blockers-arabic-yify-128849" class="subtitle-download" >download</a></td></tr><tr data-id="128835"><td class="rating-cell"><span class="label">0</span></td><td class="flag-cell"><span class="flag flag-cn"></span><span class="sub-lang">Chinese</span></td><td><a href="/subtitles/blockers2018720pblurayx264-ytsmecht-chinese-128835"><span class="text-muted">subtitle</span> Blockers.2018.720p.BluRay.x264-[YTS.ME].cht </a></td><td class="other-cell"></td><td class="uploader-cell"><a href="/user/osamawang">osamawang</a></td><td class="download-cell"><a href="/subtitles/blockers-chinese-yify-128835" class="subtitle-download" >download</a></td></tr><tr data-id="128543" class="high-rating"><td class="rating-cell"><span class="label label-success">6</span></td><td class="flag-cell"><span class="flag flag-gb"></span><span class="sub-lang">English</span></td><td><a href="/subtitles/blockers2018web-dlx264-fgt-english-128543"><span class="text-muted">subtitle</span> Blockers.2018.WEB-DL.x264-FGT</a></td><td class="other-cell"></td><td class="uploader-cell"><a href="/user/sub">sub</a></td><td class="download-cell"><a href="/subtitles/blockers-english-yify-128543" class="subtitle-download" >download</a></td></tr><tr data-id="128633"><td class="rating-cell"><span class="label">0</span></td><td class="flag-cell"><span class="flag flag-rs"></span><span class="sub-lang">Serbian</span></td><td><a href="/subtitles/blockers2018720pblurayx264ytsag-serbian-128633"><span class="text-muted">subtitle</span> Blockers.2018.720p.BluRay.x264.[YTS.AG]</a></td><td class="other-cell"></td><td class="uploader-cell"><a href="/user/TesneGace">TesneGace</a></td><td class="download-cell"><a href="/subtitles/blockers-serbian-yify-128633" class="subtitle-download" >download</a></td></tr><tr data-id="128702"><td class="rating-cell"><span class="label label-success">2</span></td><td class="flag-cell"><span class="flag flag-es"></span><span class="sub-lang">Spanish</span></td><td><a href="/subtitles/blockers2018720pblurayx264ytsag-spanish-128702"><span class="text-muted">subtitle</span> Blockers.2018.720p.BluRay.x264.[YTS.AG]</a></td><td class="other-cell"></td><td class="uploader-cell"><a href="/subtitles/blockers-english-yify-128543

and I am trying to match the first occurance of english-yify "/subtitles/blockers-english-yify-128543

my pattern is re.search(r'/subtitles/.+\-english\-yify-\d+',text)

but my code returns the entire string, pls help

my regex available here

Pyd
  • 6,017
  • 18
  • 52
  • 109

1 Answers1

-1

Your string is in fact html - you should use html parser instead. I suggest the excellent lxml.html parser.

To answer your question, regexes are greedy by default, that means your .+ part will grab as many chars as it can to satisfy the condition. So you will get the first /subtitles/ and the last -english\-yify- and everything in between.

nosklo
  • 217,122
  • 57
  • 293
  • 297
  • Use `.+?` instead for a non-greedy qualifier. https://docs.python.org/3/library/re.html#regular-expression-syntax – Håken Lid Jun 27 '18 at 15:58
  • i tried not working, please check my regex https://regex101.com/r/kyzg1J/4 – Pyd Jun 27 '18 at 15:59
  • Use something like `\w` instead of `.`, to avoid matching spaces and `"` etc. https://regex101.com/r/LL4zAq/2 – Håken Lid Jun 27 '18 at 16:06
  • this string also similar to that one but not matching https://regex101.com/r/LL4zAq/3 same regex, with different movie name – Pyd Jun 27 '18 at 16:25
  • @HåkenLid can you see – Pyd Jun 27 '18 at 16:42
  • A dash `-` is not considered a "word character" and not captured by `\w`. You can use `[^\s\"]` instead. https://regex101.com/r/LL4zAq/4 – Håken Lid Jun 27 '18 at 16:48
  • Be advised that it's quite possible to come across valid html that can't be parsed by a regular expression. For example if there are html comments; if there are url that are not wrapped in `"` quotes; if the urls contain unexpected characters; etc. Libraries such as beatutifulsoup and scrapy can handle all sorts of input data, including malformed html. See this answer for more details: https://stackoverflow.com/a/1732454/1977847 – Håken Lid Jun 27 '18 at 16:59