What is the safest way to extract from an HTML file using xpath?</a></h1> </div> <div class="grid fw-wrap pb8 mb16 bb bc-black-075"> <div class="grid--cell ws-nowrap mr16 mb8" title="2016-01-12 19:07:53Z"> <span class="fc-light mr2">Asked</span> <time itemprop="dateCreated" datetime="2010-08-18T01:20:36.637" class="fromnow">Aug 18 '10 at 01:20</time> </div> <div class="grid--cell ws-nowrap mr16 mb8"> <span class="fc-light mr2">Active</span> <time class="fromnow" title="2010-08-18T08:13:15.570" datetime="2010-08-18T08:13:15.570">Aug 18 '10 at 08:13</a> </div> <div class="grid--cell ws-nowrap mb8" title="Viewed 15,582 times"> <span class="fc-light mr2">Viewed</span> 1.6k times </div> </div> <div id="mainbar" role="main" aria-label="questions and answers"> <div id="question" class="question" data-questionid="3508281" data-ownerid="164168" data-score="6"> <div class="post-layout"> <div class="votecell post-layout--left"> <div class="js-voting-container grid jc-center fd-column ai-stretch gs4 fc-black-200" data-post-id="3508281"> <button class="js-vote-up-btn grid--cell s-btn s-btn__unset c-pointer"><svg aria-hidden="true" class="m0 svg-icon iconArrowUpLg" width="36" height="36" viewBox="0 0 36 36"><path d="M2 26h32L18 10 2 26z"></path></svg></button> <div class="js-vote-count grid--cell fc-black-500 fs-title grid fd-column ai-center" itemprop="upvoteCount" data-value="6">6</div> <button class="js-bookmark-btn s-btn s-btn__unset c-pointer py4"> <svg aria-hidden="true" class="svg-icon iconBookmark" width="18" height="18" viewBox="0 0 18 18"><path d="M6 1a2 2 0 00-2 2v14l5-4 5 4V3a2 2 0 00-2-2H6zm3.9 3.83h2.9l-2.35 1.7.9 2.77L9 7.59l-2.35 1.7.9-2.76-2.35-1.7h2.9L9 2.06l.9 2.77z"></path></svg> <div class="js-bookmark-count mt4" data-value=""></div> </button> </div> </div> <div class="postcell post-layout--right"> <div class="s-prose js-post-body" itemprop="text"><p>Here is my current xpath code <code>"/html/head/title"</code>.</p> <p>But you know, in the real world html environment, the code format usually broken, e.g. <code><html></code> tag is missing could cause an exception. So, I would like to know if there's a safe way to extract the <code><title></code> tag? (something like getElementByTagName)</p></div> <div class="mt24 mb12"> <div class="post-taglist grid gs4 gsy fd-column"> <div class="grid ps-relative"> <a href="../../questions/tagged/html" class="post-tag js-gps-track" title="show questions tagged 'html'" rel="tag">html</a> <a href="../../questions/tagged/xpath" class="post-tag js-gps-track" title="show questions tagged 'xpath'" rel="tag">xpath</a> </div> </div> </div> <div class="mb0"> <div class="mt16 grid gs8 gsy fw-wrap jc-end ai-start pt4 mb16"> <div class="grid--cell mr16 fl1 w96"></div> <div class="post-signature owner grid--cell"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Aug 18 '10 at 01:20">asked Aug 18 '10 at 01:20</time> <a href="../../users/164168/silent" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/164168.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="silent" /> </a> <div class="s-user-card--info"> <a href="../../users/164168/silent" class="s-user-card--link">silent</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">3,964</li> <li class="s-award-bling s-award-bling__gold" title="5 gold badges">5</li> <li class="s-award-bling s-award-bling__silver" title="27 silver badges">27</li> <li class="s-award-bling s-award-bling__bronze" title="29 bronze badges">29</li> </ul> </div> </div> </div> </div> </div> </div> <div class="post-layout--right js-post-comments-component"> <div id="comments-3508281" class="comments js-comments-container bt bc-black-075 mt12 " data-post-id="3508281" data-min-length="15"> <ul class="comments-list js-comments-list" data-remaining-comments-count="0" data-canpost="false" data-cansee="true" data-comments-unavailable="false" data-addlink-disabled="true"> <li id="comment-3666841" class="comment js-comment " data-comment-id="3666841" data-comment-owner-id="313121" data-comment-score="0"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment3666841_3508281"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">Where is the code running, in the browser or elsewhere?</span> – <a href="../../users/313121/tahbaza" title="9,486 reputation" class="comment-user ">Tahbaza</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/3508281/what-is-the-safest-way-to-extract-title-from-an-html-file-using-xpath#comment3666841_3508281"><span title="2010-08-18T01:25:17.437 License: CC BY-SA 2.5" class="relativetime-clean">Aug 18 '10 at 01:25</span></a></span> </div> </div> </li> <li id="comment-3666856" class="comment js-comment " data-comment-id="3666856" data-comment-owner-id="407664" data-comment-score="3"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> <span title="number of 'useful comment' votes received" class="warm">3</span> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment3666856_3508281"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">I'm not sure if it's a good idea to assume that an HTML page that is missing the `` tag is also well-formed enough to be searchable through XPath.</span> – <a href="../../users/407664/mhmmd" title="1,483 reputation" class="comment-user ">Mhmmd</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/3508281/what-is-the-safest-way-to-extract-title-from-an-html-file-using-xpath#comment3666856_3508281"><span title="2010-08-18T01:28:45.413 License: CC BY-SA 2.5" class="relativetime-clean">Aug 18 '10 at 01:28</span></a></span> </div> </div> </li> </ul> </div> </div> </div> </div> <div id="answers"> <a name="tab-top"></a> <div id="answers-header"> <div class="answers-subheader grid ai-center mb8"> <div class="grid--cell fl1"> <h2 class="mb0" data-answercount="9">5 Answers<span style="display:none;" itemprop="answerCount">5</span></h2> </div> </div> </div> <a name="3508296"></a> <div id="answer-3508296" class="answer accepted-answer" data-answerid="3508296" data-ownerid="145190" data-score="10" itemprop="acceptedAnswer" itemscope="" itemtype="https://schema.org/Answer"> <div class="post-layout"> <div class="votecell post-layout--left"> <div class="js-voting-container grid jc-center fd-column ai-stretch gs4 fc-black-200" data-post-id="3508296"> <button class="js-vote-up-btn grid--cell s-btn s-btn__unset c-pointer"><svg aria-hidden="true" class="m0 svg-icon iconArrowUpLg" width="36" height="36" viewBox="0 0 36 36"><path d="M2 26h32L18 10 2 26z"></path></svg></button> <div class="js-vote-count grid--cell fc-black-500 fs-title grid fd-column ai-center" itemprop="upvoteCount" data-value="10">10</div> <div class="js-accepted-answer-indicator grid--cell fc-green-500 py6 mtn8"><div class="ta-center"><svg aria-hidden="true" class="svg-icon iconCheckmarkLg" width="36" height="36" viewBox="0 0 36 36"><path d="m6 14 8 8L30 6v8L14 30l-8-8v-8z"></path></svg></div></div> </div> </div> <div class="postcell post-layout--right"> <div class="s-prose js-post-body" itemprop="text"><p><code>"//title"</code> perhaps?</p></div> <div class="mb0"> <div class="mt16 grid gs8 gsy fw-wrap jc-end ai-start pt4 mb16"> <div class="grid--cell mr16 fl1 w96"></div> <div class="post-signature grid--cell"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="answered Aug 18 '10 at 01:25">answered Aug 18 '10 at 01:25</time> <a href="../../users/145190/meder-omuraliev" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/145190.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="meder omuraliev" /> </a> <div class="s-user-card--info"> <a href="../../users/145190/meder-omuraliev" class="s-user-card--link">meder omuraliev</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">183,342</li> <li class="s-award-bling s-award-bling__gold" title="71 gold badges">71</li> <li class="s-award-bling s-award-bling__silver" title="393 silver badges">393</li> <li class="s-award-bling s-award-bling__bronze" title="434 bronze badges">434</li> </ul> </div> </div> </div> </div> </div> </div> <div class="post-layout--right js-post-comments-component"> </div> </div> </div> <a name="3508297"></a> <div id="answer-3508297" class="answer " data-answerid="3508297" data-ownerid="47550" data-score="3" itemprop="suggestedAnswer" itemscope="" itemtype="https://schema.org/Answer"> <div class="post-layout"> <div class="votecell post-layout--left"> <div class="js-voting-container grid jc-center fd-column ai-stretch gs4 fc-black-200" data-post-id="3508297"> <button class="js-vote-up-btn grid--cell s-btn s-btn__unset c-pointer"><svg aria-hidden="true" class="m0 svg-icon iconArrowUpLg" width="36" height="36" viewBox="0 0 36 36"><path d="M2 26h32L18 10 2 26z"></path></svg></button> <div class="js-vote-count grid--cell fc-black-500 fs-title grid fd-column ai-center" itemprop="upvoteCount" data-value="3">3</div> </div> </div> <div class="postcell post-layout--right"> <div class="s-prose js-post-body" itemprop="text"><p>Because of the unruly nature of html markup you should use an html parsing library. You didn't specify a platform or language but there are a number of <a class="external-link" href="http://www.google.com/search?hl=en&rlz=1C1SNNT_enUS377US377&q=open+source+html+parsing+library&aq=f&aqi=m1&aql=&oq=&gs_rfai=" rel="nofollow noreferrer">open source libraries out there.</a></p></div> <div class="mb0"> <div class="mt16 grid gs8 gsy fw-wrap jc-end ai-start pt4 mb16"> <div class="grid--cell mr16 fl1 w96"></div> <div class="post-signature grid--cell"> <div class="user-info "> <div class="user-action-time">edited <span title="2010-08-18T01:43:23.507" class="relativetime">Aug 18 '10 at 01:43</span></div> <div class="user-gravatar32"></div> <div class="user-details" itemprop="author" itemscope="" itemtype="http://schema.org/Person"> <span class="d-none" itemprop="name">Paul Sasik</span> <div class="-flair"></div> </div> </div> </div> <div class="post-signature grid--cell"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="answered Aug 18 '10 at 01:25">answered Aug 18 '10 at 01:25</time> <a href="../../users/47550/paul-sasik" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/47550.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Paul Sasik" /> </a> <div class="s-user-card--info"> <a href="../../users/47550/paul-sasik" class="s-user-card--link">Paul Sasik</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">79,492</li> <li class="s-award-bling s-award-bling__gold" title="20 gold badges">20</li> <li class="s-award-bling s-award-bling__silver" title="149 silver badges">149</li> <li class="s-award-bling s-award-bling__bronze" title="189 bronze badges">189</li> </ul> </div> </div> </div> </div> </div> </div> <div class="post-layout--right js-post-comments-component"> <div id="comments-3508297" class="comments js-comments-container bt bc-black-075 mt12 " data-post-id="3508297" data-min-length="15"> <ul class="comments-list js-comments-list" data-remaining-comments-count="0" data-canpost="false" data-cansee="true" data-comments-unavailable="false" data-addlink-disabled="true"> <li id="comment-3666855" class="comment js-comment " data-comment-id="3666855" data-comment-owner-id="47773" data-comment-score="1"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> <span title="number of 'useful comment' votes received" class="warm">1</span> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment3666855_3508297"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">You can use XPath *with* an HTML parsing library. Html Agility Pack is just one example, of many, that supports both.</span> – <a href="../../users/47773/matthew-flaschen" title="278,309 reputation" class="comment-user ">Matthew Flaschen</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/3508281/what-is-the-safest-way-to-extract-title-from-an-html-file-using-xpath#comment3666855_3508297"><span title="2010-08-18T01:28:35.070 License: CC BY-SA 2.5" class="relativetime-clean">Aug 18 '10 at 01:28</span></a></span> </div> </div> </li> <li id="comment-3666875" class="comment js-comment " data-comment-id="3666875" data-comment-owner-id="47550" data-comment-score="0"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment3666875_3508297"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">@Matthew: Good point. I qualified the xpath statement in my answer.</span> – <a href="../../users/47550/paul-sasik" title="79,492 reputation" class="comment-user ">Paul Sasik</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/3508281/what-is-the-safest-way-to-extract-title-from-an-html-file-using-xpath#comment3666875_3508297"><span title="2010-08-18T01:31:48.400 License: CC BY-SA 2.5" class="relativetime-clean">Aug 18 '10 at 01:31</span></a></span> </div> </div> </li> <li id="comment-3666890" class="comment js-comment " data-comment-id="3666890" data-comment-owner-id="47773" data-comment-score="0"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment3666890_3508297"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">I don't get what "attempting xpath [...] directly on the markup" means. XPath requires the markup is already parsed to a DOM.</span> – <a href="../../users/47773/matthew-flaschen" title="278,309 reputation" class="comment-user ">Matthew Flaschen</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/3508281/what-is-the-safest-way-to-extract-title-from-an-html-file-using-xpath#comment3666890_3508297"><span title="2010-08-18T01:35:23.680 License: CC BY-SA 2.5" class="relativetime-clean">Aug 18 '10 at 01:35</span></a></span> </div> </div> </li> <li id="comment-3666942" class="comment js-comment " data-comment-id="3666942" data-comment-owner-id="47550" data-comment-score="0"> <div class="js-comment-actions comment-actions"> <div class="comment-score js-comment-edit-hide"> </div> </div> <div class="comment-text js-comment-text-and-form"> <a name="comment3666942_3508297"></a> <div class="comment-body js-comment-edit-hide"> <span class="comment-copy">@Matthew: Fair enough. I was making assumptions (such as some HTML->XML process) with very little context. Paring down the answer to just suggest the use of a library, which I supposed I'm assuming is not being used.</span> – <a href="../../users/47550/paul-sasik" title="79,492 reputation" class="comment-user ">Paul Sasik</a> <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/3508281/what-is-the-safest-way-to-extract-title-from-an-html-file-using-xpath#comment3666942_3508297"><span title="2010-08-18T01:45:42.987 License: CC BY-SA 2.5" class="relativetime-clean">Aug 18 '10 at 01:45</span></a></span> </div> </div> </li> </ul> </div> </div> </div> </div> <a name="3510125"></a> <div id="answer-3510125" class="answer " data-answerid="3510125" data-ownerid="42585" data-score="3" itemprop="suggestedAnswer" itemscope="" itemtype="https://schema.org/Answer"> <div class="post-layout"> <div class="votecell post-layout--left"> <div class="js-voting-container grid jc-center fd-column ai-stretch gs4 fc-black-200" data-post-id="3510125"> <button class="js-vote-up-btn grid--cell s-btn s-btn__unset c-pointer"><svg aria-hidden="true" class="m0 svg-icon iconArrowUpLg" width="36" height="36" viewBox="0 0 36 36"><path d="M2 26h32L18 10 2 26z"></path></svg></button> <div class="js-vote-count grid--cell fc-black-500 fs-title grid fd-column ai-center" itemprop="upvoteCount" data-value="3">3</div> </div> </div> <div class="postcell post-layout--right"> <div class="s-prose js-post-body" itemprop="text"><p>Actually <code>/html/head/title</code> should work just fine, even on badly malformed mark-up, assuming: </p> <ul> <li>there is a title element; </li> <li>your HTML parser behaves the same way browser parsers do; </li> <li>your HTML parser puts the HTML elements into the null namespace.</li> </ul> <p>You will have to allow for the possibility of there being multiple title elements in invalid HTML, so <code>/html/head/title[1]</code> is possibly better.</p></div> <div class="mb0"> <div class="mt16 grid gs8 gsy fw-wrap jc-end ai-start pt4 mb16"> <div class="grid--cell mr16 fl1 w96"></div> <div class="post-signature grid--cell"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="answered Aug 18 '10 at 08:13">answered Aug 18 '10 at 08:13</time> <a href="../../users/42585/alohci" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/42585.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Alohci" /> </a> <div class="s-user-card--info"> <a href="../../users/42585/alohci" class="s-user-card--link">Alohci</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">78,296</li> <li class="s-award-bling s-award-bling__gold" title="16 gold badges">16</li> <li class="s-award-bling s-award-bling__silver" title="112 silver badges">112</li> <li class="s-award-bling s-award-bling__bronze" title="156 bronze badges">156</li> </ul> </div> </div> </div> </div> </div> </div> <div class="post-layout--right js-post-comments-component"> </div> </div> </div> <a name="3508306"></a> <div id="answer-3508306" class="answer " data-answerid="3508306" data-ownerid="411247" data-score="1" itemprop="suggestedAnswer" itemscope="" itemtype="https://schema.org/Answer"> <div class="post-layout"> <div class="votecell post-layout--left"> <div class="js-voting-container grid jc-center fd-column ai-stretch gs4 fc-black-200" data-post-id="3508306"> <button class="js-vote-up-btn grid--cell s-btn s-btn__unset c-pointer"><svg aria-hidden="true" class="m0 svg-icon iconArrowUpLg" width="36" height="36" viewBox="0 0 36 36"><path d="M2 26h32L18 10 2 26z"></path></svg></button> <div class="js-vote-count grid--cell fc-black-500 fs-title grid fd-column ai-center" itemprop="upvoteCount" data-value="1">1</div> </div> </div> <div class="postcell post-layout--right"> <div class="s-prose js-post-body" itemprop="text"><p>If you can use javascript, you can do it:</p> <pre><code>document.title </code></pre></div> <div class="mb0"> <div class="mt16 grid gs8 gsy fw-wrap jc-end ai-start pt4 mb16"> <div class="grid--cell mr16 fl1 w96"></div> <div class="post-signature grid--cell"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="answered Aug 18 '10 at 01:26">answered Aug 18 '10 at 01:26</time> <a href="../../users/411247/topera" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/411247.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Topera" /> </a> <div class="s-user-card--info"> <a href="../../users/411247/topera" class="s-user-card--link">Topera</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">12,223</li> <li class="s-award-bling s-award-bling__gold" title="15 gold badges">15</li> <li class="s-award-bling s-award-bling__silver" title="67 silver badges">67</li> <li class="s-award-bling s-award-bling__bronze" title="104 bronze badges">104</li> </ul> </div> </div> </div> </div> </div> </div> <div class="post-layout--right js-post-comments-component"> </div> </div> </div> <a name="3508301"></a> <div id="answer-3508301" class="answer " data-answerid="3508301" data-ownerid="279130" data-score="0" itemprop="suggestedAnswer" itemscope="" itemtype="https://schema.org/Answer"> <div class="post-layout"> <div class="votecell post-layout--left"> <div class="js-voting-container grid jc-center fd-column ai-stretch gs4 fc-black-200" data-post-id="3508301"> <button class="js-vote-up-btn grid--cell s-btn s-btn__unset c-pointer"><svg aria-hidden="true" class="m0 svg-icon iconArrowUpLg" width="36" height="36" viewBox="0 0 36 36"><path d="M2 26h32L18 10 2 26z"></path></svg></button> <div class="js-vote-count grid--cell fc-black-500 fs-title grid fd-column ai-center" itemprop="upvoteCount" data-value="0">0</div> </div> </div> <div class="postcell post-layout--right"> <div class="s-prose js-post-body" itemprop="text"><p>If you have something that an XML parser can parse (which is not the case with most HTML, but needs to be the case to use XPath), then you could use <code>//title</code> to get the element.</p></div> <div class="mb0"> <div class="mt16 grid gs8 gsy fw-wrap jc-end ai-start pt4 mb16"> <div class="grid--cell mr16 fl1 w96"></div> <div class="post-signature grid--cell"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="answered Aug 18 '10 at 01:26">answered Aug 18 '10 at 01:26</time> <a href="../../users/279130/jwismar" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/279130.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="jwismar" /> </a> <div class="s-user-card--info"> <a href="../../users/279130/jwismar" class="s-user-card--link">jwismar</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">12,164</li> <li class="s-award-bling s-award-bling__gold" title="3 gold badges">3</li> <li class="s-award-bling s-award-bling__silver" title="32 silver badges">32</li> <li class="s-award-bling s-award-bling__bronze" title="44 bronze badges">44</li> </ul> </div> </div> </div> </div> </div> </div> <div class="post-layout--right js-post-comments-component"> </div> </div> </div> </div> </div> <div id="sidebar" class="show-votes" role="complementary" aria-label="sidebar"> <div class="module sidebar-linked"> <h4 id="h-linked">Linked</h4> <div class="linked"> <div class="spacer"> <a title="Vote score (upvotes - downvotes)"><div class="answer-votes default">0</div></a> <a href="../../questions/24600911/my-importxml-xpath-soundcloud-playlist-not-working" class="question-hyperlink">My importXML + xPath + Soundcloud playlist Not Working</a> </div> </div> </div> </div> </div> </div> <script src="../../static/js/stack-icons.js"></script> <script src="../../static/js/fromnow.js"></script> </body> </html>