9

HTML code example:

<meta http-equiv="Content-type" content="text/html;charset=utf-8" />

I want to use RegEx to extract the charset information (i.e. here, it's "utf-8")

(I'm using C#)

silent
  • 3,964
  • 5
  • 27
  • 29
  • What language are you using? They all have subtle (and no so subtle) differences in their RegEx dialects. – Oded Aug 11 '10 at 12:36
  • 4
    `Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.` ~ Jamie Zawinski – Alex Larzelere Aug 11 '10 at 12:43
  • if your example html is the entirety of the string you have to parse, then regex is ok, but if your string is an entire html document, "You's" answer applies. – Benjol Aug 11 '10 at 13:01
  • 10
    `And some people, when confronted with regular expressions, think "I know, I'll use a catchy quote that I remember". Now they have added nothing to the discussion.` ~ Tomalak – Bart Kiers Aug 11 '10 at 13:05

9 Answers9

17

My answer provides a more robust version of @Floyd's and, to the degree possible, addresses @You's breakage test case, where a negative lookahead is used to avoid it. There's really only one relevant case I can think of (a variant of @You's example) where it will give a false positive, but I think it would be pretty rare. Expressions are expected to be run with the case-insensitive flag and were tested using java.util.regex and JRegex.

Capture groups are automatically trimmed and never include quotes, nor other tag chars like "/" or ">". In the second expression, there are 2 capture groups; the first being the content-type value, which may be empty (i.e., when using charset attribtue), and the second being the charset value, which will always be non-empty (unless the charset value is literally left empty for some odd reason).

Regex for matching/grouping charset value only - trimmed, skips quotes

<meta(?!\s*(?:name|value)\s*=)[^>]*?charset\s*=[\s"']*([^\s"'/>]*)

Same as above, but also matches/groups content-type (optional) and charset (required) values, trimmed, skips quotes. Minor caveat - Misses matching standalone content type value, i.e., "text/html"

<meta(?!\s*(?:name|value)\s*=)(?:[^>]*?content\s*=[\s"']*)?([^>]*?)[\s"';]*charset\s*=[\s"']*([^\s"'/>]*)

Test cases (all pass except the very last one)...

<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1"/>
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" />
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1'/>
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1' />
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1/>
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1 />
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" >
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1'>
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1' >
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1>
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1 >

<meta http-equiv="Content-Type" content="text/html;charset='iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html;charset=iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html';charset='iso-8859-1'">
<meta http-equiv='Content-Type' content='text/html;charset="iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html;charset=iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html";charset="iso-8859-1"'>

<meta http-equiv="Content-Type" content="text/html;;;charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html;;;charset='iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html;;;charset=iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html';;;charset='iso-8859-1'">
<meta http-equiv='Content-Type' content='text/html;;;charset=iso-8859-1'>
<meta http-equiv='Content-Type' content='text/html;;;charset="iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html;;;charset=iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html";;;charset="iso-8859-1"'>

<meta  http-equiv  =  "  Content-Type  "  content  =  "  '  text/html  '  ;  ;;  '  ;  '  '  ;  '  ;  ' ;;  ;  charset  =  '  iso-8859-1  '  "  >
<meta  content  =  "  '  text/html  '  ;  ;;  '  ;  '  '  ;  '  ;  ' ;;  ;  charset  =  '  iso-8859-1  '  "  http-equiv  =  "  Content-Type  "  >
<meta  http-equiv  =  Content-Type  content  =  text/html;charset=iso-8859-1  >
<meta  content  =  text/html;charset=iso-8859-1  http-equiv  =  Content-Type  >
<meta  http-equiv  =  Content-Type  content  =  text/html  ;  charset  =  iso-8859-1  >
<meta  content  =  text/html  ;  charset  =  iso-8859-1  http-equiv  =  Content-Type  >
<meta  http-equiv  =  Content-Type  content  =  text/html  ;;;  charset  =  iso-8859-1  >
<meta  content  =  text/html  ;;;  charset  =  iso-8859-1  http-equiv  =  Content-Type  >
<meta  http-equiv  =  Content-Type  content  =  text/html  ;  ;  ;  charset  =  iso-8859-1  >
<meta  content  =  text/html  ;  ;  ;  charset  =  iso-8859-1  http-equiv  =  Content-Type  >

<meta charset="utf-8"/>
<meta charset="utf-8" />
<meta charset='utf-8'/>
<meta charset='utf-8' />
<meta charset=utf-8/>
<meta charset=utf-8 />
<meta charset="utf-8">
<meta charset="utf-8" >
<meta charset='utf-8'>
<meta charset='utf-8' >
<meta charset=utf-8>
<meta charset=utf-8 >

<meta  charset  =  "  utf-8  "  >
<meta  charset  =  '  utf-8  '  >
<meta  charset  =  "  utf-8  '  >
<meta  charset  =  '  utf-8  "  >
<meta  charset  =  "  utf-8     >
<meta  charset  =  '  utf-8     >
<meta  charset  =     utf-8  '  >
<meta  charset  =     utf-8  "  >
<meta  charset  =     utf-8     >
<meta  charset  =     utf-8    />

<meta name="title" value="charset=utf-8 — is it really useful (yep)?">
<meta value="charset=utf-8 — is it really useful (yep)?" name="title">
<meta name="title" content="charset=utf-8 — is it really useful (yep)?">
<meta name="charset=utf-8" content="charset=utf-8 — is it really useful (yep)?">

<meta content="charset=utf-8 — is it really useful (nope, not here, but gotta admit pretty robust otherwise)?" name="title">
sisu
  • 241
  • 3
  • 3
9

This regex:

<meta.*?charset=([^"']+)

Should work. Using an XML parser to extract this is overkill.

NullUserException
  • 83,810
  • 28
  • 209
  • 234
  • Hm... ``. Give me a HTML-parsing regex, and I shall break it. – You Aug 11 '10 at 13:19
  • @You This is a contrived non-example that would almost never occur in real world usage. – NullUserException Aug 11 '10 at 13:24
  • I am happy with my regex working 99.9% of the time. By the way, you can't always use an XML parser because real world markup is rarely well behaved. – NullUserException Aug 11 '10 at 13:33
  • 1
    +1, although I would make the .* a non-capturing group, so as a string literal in C# it would be "\\ – John M Gant Aug 11 '10 at 13:41
  • If you're handling XHTML, it *should* be valid XML. Otherwise it's not XHTML. In the case of HTML, an SGML parser will be able to parse it, in as many cases as this regex will work. If not more. – You Aug 11 '10 at 15:11
0

This regular expression will capture the charset value itself from any meta tag:

(?<=([<META|<meta])(.*)charset=)([^"'>]*)

Example input:

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta http-equiv=Content-Type content=text/html; charset=windows-1252>
<meta http-equiv=Content-Type content='text/html; charset=windows-1252'>
<meta http-equiv="Content-type" content="text/html;charset=utf-8" /> 
<meta http-equiv="Content-type" content="text/html;charset=iso-8859-1" /> 

Use it like this:

Regex regexObj = new Regex("(?<=<meta(.*)charset=)([^\"'>]*)", RegexOptions.IgnoreCase);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
    for (int i = 1; i < matchResults.Groups.Count; i++) {
        Group groupObj = matchResults.Groups[i];
        if (groupObj.Success) {
            // matched text: groupObj.Value
            // match start: groupObj.Index
            // match length: groupObj.Length
        } 
    }
    matchResults = matchResults.NextMatch();
} 

Will find these values:

windows-1252

windows-1252

windows-1252

utf-8

iso-8859-1

krisdyson
  • 3,217
  • 7
  • 43
  • 86
0

Try also :

<meta(?!\s*(?:name|value)\s*=)[^>]*?charset\s*=[\s"']*([a-zA-Z0-9-]+)[\s"'\/]*>
Stephan
  • 41,764
  • 65
  • 238
  • 329
Mikhail Gerasimov
  • 36,989
  • 16
  • 116
  • 159
0

I tried with javascript placing your string in a variable and doing a match:

var x = '<meta http-equiv="Content-type" content="text/html;charset=utf-8" />';
var result = x.match(/charset=([a-zA-Z0-9-]+)/);
alert(result[1]);
Zsolti
  • 1,571
  • 1
  • 11
  • 22
0

For PHP:

$charset = preg_match('/charset=([a-zA-Z0-9-]+)/', $line);
$charset = $charset[1];
Delan Azabani
  • 79,602
  • 28
  • 170
  • 210
  • 1
    -1, using regexps is not a good idea. See my comment on the answer by @Zsolti. – You Aug 11 '10 at 12:56
0

I tend to agree with @You however I'll give you the answer you request plus some other solutions.

        String meta = "<meta http-equiv=\"Content-type\" content=\"text/html;charset=utf-8\" />";
        String charSet = System.Text.RegularExpressions.Regex.Replace(meta,"<meta.*charset=([^\\s'\"]+).*","$1");

        // if meta tag has attributes encapsulated by double quotes
        String charSet = ((meta.Split(new String[] { "charset=" }, StringSplitOptions.None))[1].Split('"'))[0];
        // if meta tag has attributes encapsulated by single quotes
        String charSet = ((meta.Split(new String[] { "charset=" }, StringSplitOptions.None))[1].Split('\''))[0];

Either way any of the above should work, however definitely the String.Split commands can be dangerous without first checking to see if the array has data, so might want to break out the above otherwise you'll get a NullException.

Brian
  • 163
  • 1
  • 2
  • 10
0

My regex:

<meta[^>]*?charset=([^"'>]*)

My testcase:

<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
<meta name="author" value="me"><!-- Maybe we should have a charset=something meta element? --><meta charset="utf-8">

C#-Code:

using System.Text.RegularExpressions;
string resultString = Regex.Match(sourceString, "<meta[^>]*?charset=([^\"'>]*)").Groups[1].Value;

RegEx-Description:

// <meta[^>]*?charset=([^"'>]*)
// 
// Match the characters "<meta" literally «<meta»
// Match any character that is not a ">" «[^>]*?»
//    Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
// Match the characters "charset=" literally «charset=»
// Match the regular expression below and capture its match into backreference number 1 «([^"'>]*)»
//    Match a single character NOT present in the list ""'>" «[^"'>]*»
//       Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Floyd
  • 1,898
  • 12
  • 20
-2

Don't use regular expressions to parse (X)HTML! Use a proper tool, i.e. a SGML or XML parser. Your code looks like XHTML, so I'd try an XML parser. After getting the attribute from the meta element, however; a regex would be more appropriate. Although, just a string split at ; would certainly do the trick (and faster, too).

Community
  • 1
  • 1
You
  • 22,800
  • 3
  • 51
  • 64
  • He is not parsing a whole HTML document, just a single line. – Oded Aug 11 '10 at 12:41
  • I don't see that in the original question. – David Yell Aug 11 '10 at 12:43
  • 1
    Doesn't say that anywhere. And the "no regex" rule still applies, even to single lines; (X)HTML is not a regular grammar and can't be parsed using regular expressions. – You Aug 11 '10 at 12:44