-1

I have looked a bunch of example on this site, but still can't completely get this right. I am trying to grab only stuff between > and <. Example String:

<div class='col-lg-12 hintDisplay'>slavery <b>ALSO USE</b> human trafficking</div>

First I did:

var regexp = />(.*?)</g;
var matches_array = item.toString().match(regexp);
console.log(matches_array);

and got:

>slavery <,>ALSO USE<,> human trafficking<

Then I read more and tried:

var regexp = /(>)(.*?)(?=<)/g;
var matches_array = item.toString().match(regexp);
console.log(matches_array);

and now:

>slavery ,>ALSO USE,> human trafficking

I couldn't find a document on how to get rid of the leading >. So how do I grab on the stuff inbetween > and <?

SedJ601
  • 12,173
  • 3
  • 41
  • 59
  • 4
    Why not parse the string as html and then use `jquery` to extract the content you need? In case of regex, you need to remove the parenthesis around `>`. `/>(.*?)(?=<)/g` so it's not captured. – Psidom Aug 29 '17 at 18:58
  • Thanks! I will give this a try when I get back to my desk. – SedJ601 Aug 29 '17 at 19:00
  • https://stackoverflow.com/questions/432493/how-do-you-access-the-matched-groups-in-a-javascript-regular-expression – Will Barnwell Aug 29 '17 at 20:37

2 Answers2

2

In this case I like to do a regex like:

var regexp = />([^<]+)</;

This says start with a > and then at least 1 non < followed by a <

Trying to use .*? usually leads to the kind of issues you are running into :)

https://regex101.com/r/UJrVWd/1

sniperd
  • 5,124
  • 6
  • 28
  • 44
  • I am still getting `>slavery <,>ALSO USE<,> human trafficking< (15:20:27:481 | null)` using your solution. – SedJ601 Aug 29 '17 at 20:21
  • Without `g` `>slavery <,slavery (15:25:09:977 | null)`. Only the first match. – SedJ601 Aug 29 '17 at 20:27
  • check out the regex101 link, it looks like it's working. Maybe I'm missing something specific to javascript here – sniperd Aug 29 '17 at 20:31
  • Yea, it looks good in that link, but my `console.log()` output looks like the full match instead of the group result. – SedJ601 Aug 29 '17 at 20:34
  • 2
    OP, you are printing the full match instead of the captured group to console – Will Barnwell Aug 29 '17 at 20:35
  • Okay, how do I print the captured group? – SedJ601 Aug 29 '17 at 20:35
  • 1
    https://stackoverflow.com/questions/432493/how-do-you-access-the-matched-groups-in-a-javascript-regular-expression I think this does the trick or perhaps: https://stackoverflow.com/questions/1222045/how-to-loop-all-the-elements-that-match-the-regex – sniperd Aug 29 '17 at 20:35
  • Actually, I am printing all the content of the `Array` that's storing the `Matches`. – SedJ601 Aug 29 '17 at 20:37
  • I already tried the first link before posting this question and it didn't get the job done. The second link you posted is right on and exactly what I needed to make the capture group act the way it was designed to. Thanks. – SedJ601 Aug 29 '17 at 20:46
2

Well in my opinion you should use the build-in HTML parser and use JQuery or something similar to get your text out of the HTML.

Some reasons why you shouldn't regex HTML can be found over here:

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML.

RegEx match open tags except HTML self-contained tags

marpme
  • 2,363
  • 1
  • 15
  • 24
  • Thanks, I am going to use your advice. I am not going to select your answer as the correct answer because what if the next person text is not `HTML`. This is the correct answer for my situation but not for the question. Once again, thank! – SedJ601 Aug 29 '17 at 20:29
  • 1
    Your question is about html parsing using javascript, this is the correct answer and should be accepted as such. – Will Barnwell Aug 29 '17 at 20:39