Finding a term inside a bracket in html</a></h1> </div> <div class="grid fw-wrap pb8 mb16 bb bc-black-075"> <div class="grid--cell ws-nowrap mr16 mb8" title="2016-01-12 19:07:53Z"> <span class="fc-light mr2">Asked</span> <time itemprop="dateCreated" datetime="2014-05-14T13:50:56.757" class="fromnow">May 14 '14 at 13:50</time> </div> <div class="grid--cell ws-nowrap mr16 mb8"> <span class="fc-light mr2">Active</span> <time class="fromnow" title="2014-05-14T14:06:59.540" datetime="2014-05-14T14:06:59.540">May 14 '14 at 14:06</a> </div> <div class="grid--cell ws-nowrap mb8" title="Viewed 79 times"> <span class="fc-light mr2">Viewed</span> 79 times </div> </div> <div id="mainbar" role="main" aria-label="questions and answers"> <div id="question" class="question" data-questionid="23656623" data-ownerid="3601725" data-score="0"> <div class="post-layout"> <div class="votecell post-layout--left"> <div class="js-voting-container grid jc-center fd-column ai-stretch gs4 fc-black-200" data-post-id="23656623"> <button class="js-vote-up-btn grid--cell s-btn s-btn__unset c-pointer"><svg aria-hidden="true" class="m0 svg-icon iconArrowUpLg" width="36" height="36" viewBox="0 0 36 36"><path d="M2 26h32L18 10 2 26z"></path></svg></button> <div class="js-vote-count grid--cell fc-black-500 fs-title grid fd-column ai-center" itemprop="upvoteCount" data-value="0">0</div> <button class="js-bookmark-btn s-btn s-btn__unset c-pointer py4"> <svg aria-hidden="true" class="svg-icon iconBookmark" width="18" height="18" viewBox="0 0 18 18"><path d="M6 1a2 2 0 00-2 2v14l5-4 5 4V3a2 2 0 00-2-2H6zm3.9 3.83h2.9l-2.35 1.7.9 2.77L9 7.59l-2.35 1.7.9-2.76-2.35-1.7h2.9L9 2.06l.9 2.77z"></path></svg> <div class="js-bookmark-count mt4" data-value=""></div> </button> </div> </div> <div class="postcell post-layout--right"> <div class="s-prose js-post-body" itemprop="text"><p>I am trying to find a specific string that contains a keyword inside a title tag in html e.g.</p> <pre><code><title>Bla bla bla String bla bla</title> </code></pre> <p>I am unsure how to construct that beyond the starting:</p> <pre><code>\<title\>(Word Keyword)\<\/title\> </code></pre> <p>I also want to make sure if I use any wildcards regex may be able to use that the wildcard between the keyword and the doesn't inadvertently go all the way to the end of perhaps another title block in the html.</p> <p>Lastly I'm trying to find a way to then<br/></p> <ul> <li>extract the Word Keyword only even though I've capture the entire regex</li> <li>extract/keep the separately.</li> </ul> <p>This is because I'll have several types of to captiure from and I want to extract both the 'Word Keyword' and the tag name it came from. Is this possible? I've looked into named groups but not sure if/how to extract after e.g.</p> <pre><code>(?P<TAG>(\<title\>|\<head\>)(?P<TERM>(Word Keyword))\<\/title\> </code></pre> <p>Naturally with any wildcard code as needed to make the above work but assuming it does I'd then want to be able to extract, after matching the string:</p> <ul> <li>title</li> <li>Bla Keyword</li> </ul> <p>or</p> <ul> <li>head</li> <li>Yada Keyword</li> </ul></div> <div class="mt24 mb12"> <div class="post-taglist grid gs4 gsy fd-column"> <div class="grid ps-relative"> <a href="../../questions/tagged/regex" class="post-tag js-gps-track" title="show questions tagged 'regex'" rel="tag">regex</a> </div> </div> </div> <div class="mb0"> <div class="mt16 grid gs8 gsy fw-wrap jc-end ai-start pt4 mb16"> <div class="grid--cell mr16 fl1 w96"></div> <div class="post-signature owner grid--cell"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked May 14 '14 at 13:50">asked May 14 '14 at 13:50</time> <a href="../../users/3601725/user3601725" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/3601725.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="user3601725" /> </a> <div class="s-user-card--info"> <a href="../../users/3601725/user3601725" class="s-user-card--link">user3601725</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">473</li> <li class="s-award-bling s-award-bling__gold" title="1 gold badge">1</li> <li class="s-award-bling s-award-bling__silver" title="4 silver badge">4</li> <li class="s-award-bling s-award-bling__bronze" title="8 bronze badge">8</li> </ul> </div> </div> </div> </div> </div> </div> <div class="post-layout--right js-post-comments-component"> <div id="comments-23656623" class="comments js-comments-container bt bc-black-075 mt12 " data-post-id="23656623" data-min-length="15"> <ul class="comments-list js-comments-list" data-remaining-comments-count="0" data-canpost="false" data-cansee="true" data-comments-unavailable="false" data-addlink-disabled="true"> <li id="comment-36335345" class="comment js-comment " data-comment-id="36335345" data-comment-owner-id="499214" data-comment-score="1"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> <span title="number of 'useful comment' votes received" class="warm">1</span> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment36335345_23656623"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">don't use regex to parse HTML...</span> – <a href="../../users/499214/john-dvorak" title="26,799 reputation" class="comment-user ">John Dvorak</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/23656623/finding-a-term-inside-a-title-bracket-in-html#comment36335345_23656623"><span title="2014-05-14T13:53:26.623 License: CC BY-SA 3.0" class="relativetime-clean">May 14 '14 at 13:53</span></a></span> </div> </div> </li> <li id="comment-36335543" class="comment js-comment " data-comment-id="36335543" data-comment-owner-id="1436981" data-comment-score="0"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment36335543_23656623"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">I guess using XPath would be much easier... `//title`</span> – <a href="../../users/1436981/stuxnet" title="4,309 reputation" class="comment-user ">stuXnet</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/23656623/finding-a-term-inside-a-title-bracket-in-html#comment36335543_23656623"><span title="2014-05-14T13:57:37.570 License: CC BY-SA 3.0" class="relativetime-clean">May 14 '14 at 13:57</span></a></span> </div> </div> </li> <li id="comment-36335649" class="comment js-comment " data-comment-id="36335649" data-comment-owner-id="8454" data-comment-score="0"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment36335649_23656623"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">**Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php or [this SO thread](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged.</span> – <a href="../../users/8454/andy-lester" title="91,102 reputation" class="comment-user ">Andy Lester</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/23656623/finding-a-term-inside-a-title-bracket-in-html#comment36335649_23656623"><span title="2014-05-14T13:59:47.203 License: CC BY-SA 3.0" class="relativetime-clean">May 14 '14 at 13:59</span></a></span> </div> </div> </li> <li id="comment-36336389" class="comment js-comment " data-comment-id="36336389" data-comment-owner-id="3601725" data-comment-score="0"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment36336389_23656623"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">Hi in fact we parse/clean the html in the main step and simply look for string patterns to identify text. However one of the many pattersn that would identify the text we are looking for it <title>. So I'm not counting on every html page to look/be the same at all and in fact am using regex to parse. In this case I am backing up a step to use the original html as another way of gathering the desired text string. – user3601725 May 14 '14 at 14:15

1 Answers1

1
<(title|head).*?>(.*?)<\/\1>

Regular expression visualization

This regex would contain the tag in it's first match group, and the inner html of the tag in it's second group - but consider going with XPath or any HTML/XML parser, because of Zalgo.

You need PCRE to use this expression, because of the non-greedy wildcards.

Community
  • 1
  • 1
stuXnet
  • 4,309
  • 3
  • 22
  • 31