Find Text in HTML with Regex

Question

I have some text

Trotzdem gibt es Untersuchungen, die nahelegen, dass bis zu 20% der Studierenden in Deutschland während der Prüfungsvorbereitung Ritalin einschmeissen [2], Reportagen, dass britische Studierende Modafinil bestens kennen[3] und Studierende weltweit auch nach der Silk Road — einem mittlerweile eingestellten Schwarzmarkt im Deep Web – mit illegalen „Nootropics“ experimentieren.

and I have some HTML

<p>Die <span class="caps">GDS</span> zeichnet also das Bild einer Gesellschaft, in der Drogen primär Rausch, Genuss und Spass sind. Tabak ist zwar das bekannteste – und ungesündeste – Mittel gegen Stress, aber sonst sind die Leistungssteigerer in der Liste weit abgeschlagen. Trotzdem gibt es Untersuchungen, die nahelegen, dass bis zu 20% der Studierenden in Deutschland während der Prüfungsvorbereitung Ritalin einschmeissen <a href="#_ftn2" name="_ftnref2">[2]</a>, Reportagen, dass britische Studierende Modafinil bestens kennen<a href="#_ftn3" name="_ftnref3">[3]</a> und Studierende weltweit auch nach der <a href="https://de.wikipedia.org/wiki/Silk_Road" target="_blank">Silk Road</a> — einem mittlerweile eingestellten Schwarzmarkt im Deep Web – mit illegalen „Nootropics“ experimentieren.</p>

To find the text in the HTML I produce some crazy ass regex, where I split by the spaces and join again with

\s*?(?:<\/?[^>]*?>)?\s*?

That works most of the time as seen here: https://regex101.com/r/hG9lT9/1

In the case stated on top it doesn't work because there is a comma after a html tag and there are also different dashes. So I'm searching to create a more general regex expression to fit the cause.

Here is the example that doesn't work: https://regex101.com/r/hG9lT9/2

Rule 1: don't use RegEx to parse HTML. Rule 2: if you still want to parse HTML with RegEx, see rule 1 — freefaller, Sep 09 '15 at 08:58
You can get the text from a given HTML string with JavaScript http://stackoverflow.com/questions/822452/strip-html-from-text-javascript — Tasos K., Sep 09 '15 at 09:01
[Obligatory link](http://stackoverflow.com/a/1732454/3478852) — Nisse Engström, Sep 13 '15 at 06:22
I don't actually need to parse html. I need to parse text with unknown characters in between words. — thgie, Sep 14 '15 at 06:45

score -4 · Answer 1 · answered Sep 09 '15 at 09:06

-4

Split by : <[^>]*>? (regex for the html tags)

answered Sep 09 '15 at 09:06

vrachlin

817
5
15

See the comments above, and specifically @tasos-k's comment for how to do it legitimately. – Wil Sep 09 '15 at 09:10

Find Text in HTML with Regex

1 Answers1