0

I would like to know how to remove a single html tag from a string of html tag.

Here is an example of a string in JavaScript:

var str = "<ol class="breadcrumb>
  <li>1</li>
  <li>2</li>
</ol>
<div>body</div>
<ol>
  <li>a</li>
</ol>"

I would like to remove the first <ol>. Here is what I have tried:

str.replace(/\<ol class=\"breadcrumb".*\<\/ol\>/, '');

But this solution will delete everything in the string.

I have a hunch that, rather than using .* in the above solution, I should match everything except </\ol\>. But how can I do it, and is there an alternative solution?

Edit

I am writing a web crawler and would like to parse an HTML string.

mc9
  • 6,121
  • 13
  • 49
  • 87
  • 2
    Incoming [don't parse HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?rq=1) comments. – Marty Aug 09 '16 at 00:51
  • HTML is not a regular language thus can't be parsed by regular expressions. – Derek 朕會功夫 Aug 09 '16 at 00:52
  • Use JavaScript, jQuery or some DOM parser. Your life will be much simpler. – Robert Aug 09 '16 at 00:54
  • @Marty - Tony the Pony alert – Jaromanda X Aug 09 '16 at 00:55
  • @Derek朕會功夫 I am writing a crawler and would like to process the data by using regex. Is this a valid use case, or could you suggest an alternative approach? – mc9 Aug 09 '16 at 00:56
  • 2
    @SungWonCho RegExp is mathematically insufficient to parse HTML. Use [`DOMParser`](https://developer.mozilla.org/en-US/docs/Web/API/DOMParser) instead. – Derek 朕會功夫 Aug 09 '16 at 00:59
  • @Derek朕會功夫 Thanks. Since I am on the server side, I will use an alternative library to parse and manipulate DOM. Feel free to add it as an answer so that people who are googling this can know about it. – mc9 Aug 09 '16 at 01:00
  • 1
    [Here you go](https://github.com/tmpvar/jsdom). – Marty Aug 09 '16 at 01:01
  • If there was
      tag inside of another
        tag. Regex cannot parse collectly end tag. This is limitation of regex. That needs to use some logic with javascript.
    – kyasbal Aug 09 '16 at 01:01
  • Or is there something limitation to be able to parse with regex in your html files? If you want to parse whole html tags correctly, maybe all tag cannot be nested with same name tag. Cuz,any language needs to count tags are not regular language. – kyasbal Aug 09 '16 at 01:07
  • 1
    I can not add an answer since it has been marked as a duplicate. For your specific case, it would be solved by putting a question mark after the star. So .*? This makes the search non greedy, and stops as soon as possible instead of taking as much as possible. – Froziph Aug 09 '16 at 01:11
  • @Froziph and you aren't even going to mention the issue with nesting? – Peter R Aug 09 '16 at 17:11
  • @Peter-R They are not nested in his example. I said it worked for his specific case. Using regex for html isn't great, I believe all the other posts made that abundantly clear. I just wanted to introduce him to the ? operator in case he didn't know it. – Froziph Aug 09 '16 at 20:55

0 Answers0