Getting string value between two chacter with random string in the middle

Question

I have few HTML line like this

<div class="itemA" attr="abc">VALUE I NEED TO GET</div>
<div class="itemA" data-attr="def">VALUE I NEED TO GET</div>
<div class="itemA" something-else="xyz">VALUE I NEED TO GET</div>
<div class="itemA" other="123">VALUE I NEED TO GET</div>
<div class="itemB">VALUE I DONT NEED TO GET</div>
<div class="itemB">VALUE I DONT NEED TO GET</div>

I know the way to get string value between two character in regular expression be like:

(?<=[char1]).*?(?=[char2])")

When I use this

Regex.Matches([HTML_ABOVE], @"(?<=class=""itemA"")(.*?)(?=</div>)")

Return be like:

attr="abc">VALUE I NEED TO GET
data-attr="def">VALUE I NEED TO GET
something-else="xyz">VALUE I NEED TO GET
other="123">VALUE I NEED TO GET

Is there anyway to ignore or remove pre-characters ?

I would google for a HTML parsing framework and using this ;) — Mighty Badaboom, Apr 27 '17 at 13:21
[Obligatory link](http://stackoverflow.com/a/1732454/2307070) about why ṫ̨̗̺̭̮̞̗̜̮̗̙̫̺̖̭̯͊ͨ̌͒̍͘͘͟͝h̸͓̩̙͙̻̗͔̞̘̟̩̯͋͑͂͐a̴̧ͨ́ͭ͒ͯ̓͐̇̃ͥ͢҉‌̨̳̜̤͍͖t̵̳̳͕͉͋̓͐ͦͬ̈́̀̚‌ is a bad idea — Thomas Ayoub, Apr 27 '17 at 13:25
As long as it is well-formed XML you could try to read it as XML — MiGro, Apr 27 '17 at 13:25
Or use `
]+>(.*?)(?=<\/div>)` if you want your code to work most of the time, but not everytime — Thomas Ayoub, Apr 27 '17 at 13:27
I use Regex be cause it very quick to setup, also I only need few value from the whole html page — S.A, Apr 27 '17 at 13:38
If you are parsing changeable HTML or have difficult html node scenarios, then yes use the html agility pack, it is the way to go. But if you have static and sane html and can easily scoop up data via regex, use regex. People do not like to offer regex solutions because they haven't take the time to learn the pattern matching and because of that they `vote` down any regex question on SO. — ΩmegaMan, Apr 27 '17 at 14:18

score 1 · Answer 1 · answered Apr 27 '17 at 13:26

1

If you search in your NuGet package Manager for HTMLAgilityPack, you will get a nice tool, which will do all the parsing for you. Then you do not Need the RegEx.

answered Apr 27 '17 at 13:26

Torben

438
1
7
22

ΩmegaMan · Accepted Answer · 2017-04-27T14:20:49.567

0

Use the not in set capture [^ ]+ instead to find the text. So I would change it to be

(?<=>)([^<]+)

which says to match, but not consume/capture a >. Then once found consume all text which is not a <.

Due to the html language and that its text will span lines, one will get space characters such as \r\n which will yield blank matches. So I would add to the pattern (?![\r\n]).. to not match ?! and stop if it sees such spaces.

(?<=>)(?![\r\n])([^<]+)

Here is my C# example

string data = @"<div class=""item"" attr=""abc"">VALUE I NEED TO GET</div>
<div class=""item"" data-attr=""def"">VALUE I NEED TO GET</div>
<div class=""item"" something-else=""xyz"">VALUE I NEED TO GET</div>
<div class=""item"" other=""123"">VALUE I NEED TO GET</div>";

Regex.Matches(data, @"(?<=>)(?![\r\n])([^<]+)")
     .OfType<Match>()
     .Select(itm => itm.Groups[1].Value);

Which returns four matches:

edited Apr 27 '17 at 14:20

answered Apr 27 '17 at 14:11

ΩmegaMan

29,542
12
100
122

Thanks, but i don't think it works if there are other tag inside the data, it will grab all other value inside > – S.A Apr 27 '17 at 15:20
@S.A True, but you didn't provide that as a situation which needed to be considered. Are there others? – ΩmegaMan Apr 27 '17 at 15:29
I updated to add more case, hope it will help to make my purpose clearer – S.A Apr 27 '17 at 16:35
@S.A It appears by your latest example that to get just the specific strings, you would need to add more to the lookbehind `(?<= {...})` to get the specific data, but it should work. If it gets more complicated, the node acquisition via attributes, then you may want to look into using the html agility pack. – ΩmegaMan Apr 27 '17 at 17:53
Yes, the {...} is actually which I haven't known yet, not sure what to put in there so the regular can skip everything until character ">" – S.A Apr 27 '17 at 18:17
@S.A This seems to work for me `(?<=itemA.+?>)`. The `.+?` says match as little as possible. – ΩmegaMan Apr 27 '17 at 19:59

Getting string value between two chacter with random string in the middle

2 Answers2