0

I have written a RegEx to find div tag attributes and value but it have following issues. looking for guide and help

  1. suppose white space added in attributes value then it broken, i.e - attr="a s"
  2. suppose attribute value given with ' without " then it broken, i.e - attr='as'
  3. suppose any new line break in attribute value then it broken

here is my regEx

/\b([^\s]+)(="(^'|^"|[^\s]+)*")/ig

above regEx have 3 groups

1st group result = attribute key name

2nd group result = value after =

3rd group result = attribute value without "

html tag sample

<div json="{"key1":"value1","key2":{"key3":"value3","key4":"value4","key5":"value5","key6":"value6","key7":"value7","key8":"value8"}" 
data='sdaasd' 
data-role=""
key="somekey">

sample html tag with space

<div json="{"key1":"value1","key2":  {"key3":"value3",  "key4":"value4","key5":"value5","key6":"value6","key7":"value7","key8":"value8"}" 
data='sdaasd' 
data-role=""
key="somekey">
Elankeeran
  • 6,134
  • 9
  • 40
  • 57
  • Why don't you use JavaScript DOM api to parse tags as they're supposed to be parsed instead of using the inferior (and often wrong) method of regexes? – Marko Gresak Aug 15 '15 at 02:49
  • Not using DOM api because I am getting this
    tag as string from DB. I need to parse from string
    – Elankeeran Aug 15 '15 at 02:53
  • 2
    [Obligatory](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) –  Aug 15 '15 at 02:54
  • You tagged your question as javascript. This means you can use some kind of dom parsing better than regexes (which are, as mentioned in @TinyGiant's link, *not* meant for parsing dom!). If your js is not running in the browsers, you could still use a library like jsdom, cheerio etc. Just please, for the sake of your sanity, don't try to parse any inconsistent html with a regex. – Marko Gresak Aug 15 '15 at 02:57
  • Thanks @TinyGiant but I am getting this HTML as string from DB I need parse to identify the attributes key and value. I am not find anything else another than regEx. – Elankeeran Aug 15 '15 at 03:00
  • @MarkoGrešak I tag javascript because I am using nodejs. And this string coming from MongoDB – Elankeeran Aug 15 '15 at 03:01
  • [You can convert a string to dom then parse it as such.](http://jsfiddle.net/akvwd9op/) You should never, under any circumstances (whatsoever), try to use regex to parse html. That is what parsing engines are for. –  Aug 15 '15 at 03:01
  • @TinyGiant good idea. But in nodejs if I convert string to DOM using cheerio it will create some performance issue right better will do string with regEx. Again what your suggestion? – Elankeeran Aug 15 '15 at 03:06
  • 1
    I understand what you're doing, but it's not like you're trying to do this parsing within the mongodb. If you are, please don't do it. In node (for example express app), you can use what I've suggested in my previous comment. There is no situation, other than homework, where you are forced to use a regex to parse a html, so please don't try to make it happen. Your problem is trivial to solve with any kind of html parser. And about performance: "premature optimization is the root of all evil" - Donald Knuth. Unless you're running it millions of times a minute, I wouldn't care about it. – Marko Gresak Aug 15 '15 at 03:09
  • My suggestion would be to not use regex for something that it should not be used for. Use the correct method of parsing an HTML string, which I have already suggested. Any reduction in performance because of this will be negligible. And I seriously doubt that you will get an actual regex answer worth its salt, because of the reasons in the link I posted previously. –  Aug 15 '15 at 03:10
  • Nice suggestion @TinyGiant and MarkoGresak thanks!! let me parse string to HTML and then I will use cheerio in node side. – Elankeeran Aug 15 '15 at 03:17

0 Answers0