regex match for the exact match not all the match in python

Question

hi I have a string as http://www.yifysubtitles.com/subtitles/blockers2018720pwebripx264-ytsam-arabic-128849">subtitle Blockers.2018.720p.WEBRip.x264-[YTS.AM]</a></td><td class="other-cell"></td><td class="uploader-cell"><a href="/user/SHINAWY">SHINAWY</a></td><td class="download-cell"><a href="/subtitles/blockers-arabic-yify-128849" class="subtitle-download" >download</a></td></tr><tr data-id="128835"><td class="rating-cell">0</td><td class="flag-cell">Chinese</td><td><a href="/subtitles/blockers2018720pblurayx264-ytsmecht-chinese-128835">subtitle Blockers.2018.720p.BluRay.x264-[YTS.ME].cht </a></td><td class="other-cell"></td><td class="uploader-cell"><a href="/user/osamawang">osamawang</a></td><td class="download-cell"><a href="/subtitles/blockers-chinese-yify-128835" class="subtitle-download" >download</a></td></tr><tr data-id="128543" class="high-rating"><td class="rating-cell">6</td><td class="flag-cell">English</td><td><a href="/subtitles/blockers2018web-dlx264-fgt-english-128543">subtitle Blockers.2018.WEB-DL.x264-FGT</a></td><td class="other-cell"></td><td class="uploader-cell"><a href="/user/sub">sub</a></td><td class="download-cell"><a href="/subtitles/blockers-english-yify-128543" class="subtitle-download" >download</a></td></tr><tr data-id="128633"><td class="rating-cell">0</td><td class="flag-cell">Serbian</td><td><a href="/subtitles/blockers2018720pblurayx264ytsag-serbian-128633">subtitle Blockers.2018.720p.BluRay.x264.[YTS.AG]</a></td><td class="other-cell"></td><td class="uploader-cell"><a href="/user/TesneGace">TesneGace</a></td><td class="download-cell"><a href="/subtitles/blockers-serbian-yify-128633" class="subtitle-download" >download</a></td></tr><tr data-id="128702"><td class="rating-cell">2</td><td class="flag-cell">Spanish</td><td><a href="/subtitles/blockers2018720pblurayx264ytsag-spanish-128702">subtitle Blockers.2018.720p.BluRay.x264.[YTS.AG]</a></td><td class="other-cell"></td><td class="uploader-cell"><a href="/subtitles/blockers-english-yify-128543

and I am trying to match the first occurance of english-yify "/subtitles/blockers-english-yify-128543

my pattern is re.search(r'/subtitles/.+\-english\-yify-\d+',text)

but my code returns the entire string, pls help

my regex available here

score -1 · Accepted Answer · answered Jun 27 '18 at 15:53

-1

Your string is in fact html - you should use html parser instead. I suggest the excellent lxml.html parser.

To answer your question, regexes are greedy by default, that means your .+ part will grab as many chars as it can to satisfy the condition. So you will get the first /subtitles/ and the last -english\-yify- and everything in between.

answered Jun 27 '18 at 15:53

nosklo

217,122
57
293
297

Use `.+?` instead for a non-greedy qualifier. https://docs.python.org/3/library/re.html#regular-expression-syntax – Håken Lid Jun 27 '18 at 15:58
i tried not working, please check my regex https://regex101.com/r/kyzg1J/4 – Pyd Jun 27 '18 at 15:59
Use something like `\w` instead of `.`, to avoid matching spaces and `"` etc. https://regex101.com/r/LL4zAq/2 – Håken Lid Jun 27 '18 at 16:06
this string also similar to that one but not matching https://regex101.com/r/LL4zAq/3 same regex, with different movie name – Pyd Jun 27 '18 at 16:25
@HåkenLid can you see – Pyd Jun 27 '18 at 16:42
A dash `-` is not considered a "word character" and not captured by `\w`. You can use `[^\s\"]` instead. https://regex101.com/r/LL4zAq/4 – Håken Lid Jun 27 '18 at 16:48
Be advised that it's quite possible to come across valid html that can't be parsed by a regular expression. For example if there are html comments; if there are url that are not wrapped in `"` quotes; if the urls contain unexpected characters; etc. Libraries such as beatutifulsoup and scrapy can handle all sorts of input data, including malformed html. See this answer for more details: https://stackoverflow.com/a/1732454/1977847 – Håken Lid Jun 27 '18 at 16:59

regex match for the exact match not all the match in python

1 Answers1