Get sequences of occurrences in large string of html-text

Question

I am using fetch to get an HTML file. So far I've only figured out how to get the response back as an array of text, using the text() method:

fetch(url, {
    credentials: 'same-origin'})
    .then(function(response) {
    return response.text();
}).then(function(text) {
    longAssText = text;
    textExtract = longAssText.match(/<table class='listing' id='customer-tickets'>[\s\S]*<script type='text\/javascript'>/gi);
});

The string I get back looks something like this (textExtract):

<span class="status status_active">active</span></td>
<td><a href="/tickets/365347-SOME-TITLE">#365347 SOME-TITLE</a></td>
<td>2018-03-12 09:14:34</td>
<td>2018-03-12 10:12:46</td>
<td>some category</td>
</tr>
<tr class='even'>
<td>
<img align="absmiddle" alt="Service_request_ticket" src="/images/service_request_ticket.gif?1520519528" title="some attribute" />
<img align="absmiddle" alt="Number_1" src="/images/number_1.gif?1520519528" title="Saken ligger hos 1. linje" />
<img align="absmiddle" alt="Flag_disabled" src="/images/flag_disabled.png?1520519528" title="Priority: Normal" />
</td>
<td class='ttstatus'><span class="status status_closed">closed</span></td>
<td><a href="/tickets/150640-vs-sender-e-post-brn001ba9bd7a93_000186">#150640 VS: SOME TITLE</a></td>
<td>2013-11-06 08:12:35</td>
<td>2013-11-20 21:00:11</td>
<td>Some category</td>
</tr>
<tr class='odd'>
<td>

I want to extract the text inside every a-tag prepended with the status_active class: "#365347 SOME-TITLE".

So in:

<a href="/tickets/365347-SOME-TITLE">#365347 SOME-TITLE</a>

I want to extract #365347 SOME-TITLE.

..every a-tag after a span.status_active.

I'm having a hard time with regex. I was thinking of getting all instances with regex, but I cant even get the first match.

I've tried patterns like this from([\s\S]*?)to but I'm really having a hard time wrapping my head around this.

The closest I've managed is:

(status_active)[^._]*(?=\.)

But not every text has a . at the end..

Is regex the way to go? If so could someone point me in the right direction?

If I understand, you want the text inside every a tag after a specific span, or only the first one? — Lance Toth, Mar 14 '18 at 13:31
Text inside every a-tag after a span.status_active. I've updated the question to reflect this in a better way. — Michael Krøyserth-Simsø, Mar 14 '18 at 13:38

score 0 · Accepted Answer · answered Mar 14 '18 at 13:49

0

Regex is not the way to go.

Please use an html parser (for example DomParser):

parser = new DOMParser();
htmlDoc = parser.parseFromString(text, "text/html");
...

See also this famous SO answer... :-)

answered Mar 14 '18 at 13:49

MarcoS

17,323
24
96
174

I do not know why someone would down vote this. It is exactly what I am looking for. – Michael Krøyserth-Simsø Mar 14 '18 at 14:12

Paul · Answer 2 · 2018-03-14T14:05:12.367

0

Try this one:

var regex = /status_active.*?\n*.*<a.*?>(.*?)<\/a>/gm
var matches = text.match(regex);
console.log(matches);

Another approach could be to use jQuery to parse the text and to use selectors to find the corresponding nodes. Like MarcoS already stated: This would be a much cleaner solution, since regexes are not the best tool for parsing xml structures.

edited Mar 14 '18 at 14:05

answered Mar 14 '18 at 13:59

Paul

2,086
1
8
16

Get sequences of occurrences in large string of html-text

2 Answers2