C# Regex - How to parse string for Swedish letters åäöÅÄÖ?

Question

I'm trying to parse an HTML file for strings in this format:

<a href="/userinfo/userinfo.aspx?ID=305157" target="main">MyUsername</a> O22</td>

I want to retrieve the information where "305157", "MyUsername" and the first letter in "O22" (which can be either T, K or O).

I'm using this regex; <a href="/userinfo/userinfo\.aspx\?ID=\d*" target="helgonmain">\w*</a> \w\d\d and it works fine, as long as there aren't any åäöÅÄÖ's where the "\w" are.

What should I do?

I am truly sorry, but I really need to post this link here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Fredrik Mörk, Nov 23 '09 at 21:40
I wanted to post it but figured I'd try to help him instead of showing him how futile it is to try ;)... — Wookai, Nov 23 '09 at 21:42
Yes, posting an actually helpful answer would have been better. Didn't do much html parsing code though (but would perhaps suggest looking into Html Agility Pack which seem to popup as a good html parsing library every now and then: http://www.codeplex.com/htmlagilitypack) — Fredrik Mörk, Nov 23 '09 at 23:35
It ended up being much much easier to just parse the HTML than to use HTML Agility Pack, I had actually looked at Agility before trying Regex. — Zolomon, Nov 23 '09 at 23:59

score 7 · Answer 1 · answered Nov 23 '09 at 21:42

7

You can use a character class which specifically includes those things:

[\wåäöÅÄÖ]*

Or you can use the Unicode character class for letters:

\p{L}

or specifically for Latin:

\p{InBasicLatin}

answered Nov 23 '09 at 21:42

Joey

344,408
85
689
683

score 4 · Answer 2 · edited May 23 '17 at 12:23

4

You can use \p{L} to match any 'letter', which will support all letters in all languages, as suggested in this SO question.

Or, you can simply replace \w* with [^<]*, to match all characters that are not the opening of an HTML tag.

But as said by others, parsing HTML using regex is a first step towards insanity...

edited May 23 '17 at 12:23

Community

1
1

answered Nov 23 '09 at 21:41

Wookai

20,883
16
73
86

score 3 · Accepted Answer · answered Nov 23 '09 at 21:42

Firstly: DON'T USE REGULAR EXPRESSIONS TO PARSE HTML. USE AN HTML PARSER.

Secondly: if you really want to do this (and you don't) then instead of \w you could match any character apart from '<':

<a href="/userinfo/userinfo\.aspx\?ID=\d*" target="helgonmain">[^<]*</a> \w\d\d

C# Regex - How to parse string for Swedish letters åäöÅÄÖ?

3 Answers3