Regex replacing all ASCII character codes with actual characters

Question

I have a string that looks like:

4000 BCE–5000 BCE and 600 CE–650 CE.

I am trying to use a regex to search through the string, find all character codes and replace all character codes with the corresponding actual characters. For my sample string, I want to end up with a string that looks like

4000 BCE–5000 BCE and 600 CE–650 CE.

I tried writing it in code, but I can't figure out what to write:

string line = "4000 BCE&#8211;5000 BCE and 600 CE&#8211;650 CE";

listof?datatype matches = search through `line` and find all the matches to  "&#.*?;"

foreach (?datatype match in matches){
    int extractedNumber = Convert.ToInt32(Regex.(/*extract the number that is between the &# and the ?*/));

    //convert the number to ascii character
    string actualCharacter = (char) extractedNumber + "";

    //replace character code in original line
    line = Regex.Replace(line, match, actualCharacter); 
}

Edit

My original string actually has some HTML in it and looks like:

4000 BCE–5000 BCE and 600 CE–650 CE

I used line = Regex.Replace(note, "<.*?>", string.Empty); to remove the  tags, but apparently, according to one of the most popular questions on SO, RegEx match open tags except XHTML self-contained tags, you really should not use RegEx to remove HTML.

No need for Regex. Just use `System.Net.WebUtility.HtmlDecode` or `System.Web.HttpUtility.HtmlDecode` — EZI, Jul 10 '15 at 20:08

score 2 · Accepted Answer · 2015-07-13T16:47:32.490

How about doing it in a delegate replacement.
edit: As a side note, this is a good regex to remove all tags and script blocks

<(?:script(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+)?\s*>[\S\s]*?</script\s*|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:(?:(?:"[\S\s]*?")|(?:'[\S\s]*?'))|(?:[^>]*?))+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>

C#:

string line = @"4000 BCE&#8211;5000 BCE and 600 CE&#8211;650 CE";
Regex RxCode = new Regex(@"&#([0-9]+);");
string lineNew = RxCode.Replace(
    line,
    delegate( Match match ) {
        return "" + (char)Convert.ToInt32( match.Groups[1].Value);
    }
);
Console.WriteLine( lineNew );

Output:

4000 BCE-5000 BCE and 600 CE-650 CE

edit: If you expect the hex form as well, you can handle that too.

 #  @"&\#(?:([0-9]+)|x([0-9a-fA-F]+));"

 &\#
 (?:
      ( [0-9]+ )                    # (1)
   |  x
      ( [0-9a-fA-F]+ )              # (2)
 )
 ;

C#:

Regex RxCode = new Regex(@"&#(?:([0-9]+)|x([0-9a-fA-F]+));");
string lineNew = RxCode.Replace(
    line,
    delegate( Match match ) {
        return match.Groups[1].Success ? 
            "" + (char)Convert.ToInt32( match.Groups[1].Value ) :
            "" + (char)Int32.Parse( match.Groups[2].Value, System.Globalization.NumberStyles.HexNumber);
    }
);

What is Groups[1] and Groups[2]? Can you please explain that part of the code? — Tot Zam, Jul 10 '15 at 22:55
@TotZam - They are the capture groups in the regex (see the formatted regex, marked as (1) and (2) ). — , Jul 11 '15 at 17:59

Wiktor Stribiżew · Answer 2 · 2015-07-13T14:03:38.313

You do not need any regex to convert XML entity references to literal strings.

Solution 1: XML-valid input

Here is a solution that assumes you have an XML-valid input.

Add using System.Xml; namespace and use this method:

public string XmlUnescape(string escaped)
{
    XmlDocument doc = new XmlDocument();
    XmlNode node = doc.CreateElement("root");
    node.InnerXml = escaped;
    return node.InnerText;
}

Use it like this:

var output1 = XmlUnescape("4000 BCE&#8211;5000 BCE and 600 CE&#8211;650 CE.");

Result:

enter image description here

Solution 2: Non-valid XML input with HTML/XML entities

In case you cannot use the XmlDocument with your strings since they contain invalid XML syntax, you can use the following method that uses HttpUtility.HtmlDecode to convert only the entities that are known HTML and XML entities:

public string RevertEntities(string test)
{
   Regex rxHttpEntity = new Regex(@"(&[#\w]+;)"); // Declare a regex (better initialize it as a property/field of a static class for better performance
   string last_res = string.Empty; // a temporary variable holding a previously found entity
   while (rxHttpEntity.IsMatch(test)) // if our input has something like &#101; or &nbsp;
   {
       test = test.Replace(rxHttpEntity.Match(test).Value, HttpUtility.HtmlDecode(rxHttpEntity.Match(test).Value.ToLower())); // Replace all the entity references with there literal value (&amp; => &)
       if (last_res == test) // Check if we made any change to the string
           break; // If not, stop processing (there are some unsupported entities like &ourgreatcompany;
       else
           last_res = test; // Else, go on checking for entities
    }
    return test;
}

Calling this as below:

var output2 = RevertEntities("4000 BCE&#8211;5000 BCE and 600 CE&#8211;650 CE.");

Solution 3: HtmlAgilityPack and HtmlEntity.DeEntitize

Download and install using Manage NuGet Packages for Solution an HtmlAgilityPack and use this code to get all text:

public string getCleanHtml(string html)
{
    var doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);
    return HtmlAgilityPack.HtmlEntity.DeEntitize(doc.DocumentNode.InnerText);
}

And then use

var txt = "4000 <small>BCE</small>&#8211;5000 <small>BCE</small> and 600 <small>CE</small>&#8211;650 <small>CE</small>";
var clean = getCleanHtml(txt);

Result:

enter image description here doc.DocumentNode.InnerText.Substring(doc.DocumentNode.InnerText.IndexOf("\n")).Trim();

You can use LINQ with HtmlAgilityPack, download pages (with var webGet = new HtmlAgilityPack.HtmlWeb(); var doc = webGet.Load(url);), and a lot more. And the best is that there will be no entities to handle manually.

You're making the assumption that his text is valid XML just because it has XML entities in it. Since we know nothing about the source of the input, that's not necessarily a safe assumption to make. What if it has < or > in it? Your code won't work. I wouldn't recommend this solution unless the OP provides clarification that the input string would always be valid to convert to XML. For example, what if this is actually HTML (which uses the entities too) and so it has things like
with no close tag which is illegal in XML. — dmeglio, Jul 10 '15 at 19:19
My original strings actually can have HTML. I have another Regex statement that removes the HTML, but @dman2306 is correct that you can't guarantee that all my text is valid XML. Since there always is a possibility that my HTML remover malfunctioned, I would feel safer using your second answer. — Tot Zam, Jul 10 '15 at 21:21
@TotZam then your starting point is not correct. Instead of Regex, use a real html parser like [HtmlAgilityPack](https://htmlagilitypack.codeplex.com/). Also read: http://stackoverflow.com/a/1732454/932418 Thanks for great example of [XY Problem](http://www.perlmonks.org/?node=XY+Problem) — EZI, Jul 10 '15 at 21:32
@TotZam: I agree that HtmlAgilityPack is very easy and eliminates lots of issues in case you need to parse HTML. Let me know if you need a helping hand with it. Also, my solution also deals with HTML, not only XML entities (as the other suggested solution). — Wiktor Stribiżew, Jul 10 '15 at 21:36
I tried both your answer as well as @sln answer. I really don't understand any completely so I don't know which to use. Can you explain the differences? Which one is more efficient? Which method do you think is better? — Tot Zam, Jul 13 '15 at 13:28
I am afraid you need to re-vamp your solution. You get your input from HTML, so you need to use an HTML Parser. It will eliminate this task. Please describe how your program works, and I will explain how you can simplify your code that will most probably relieve you from choosing between lesser evil. **I added comments to my code**. Please check my Solution #3. I can help more if you choose HtmlAgilityPack. — Wiktor Stribiżew, Jul 13 '15 at 13:50
I updated the answer. Looks like `HtmlAgilityPack` is your best friend here :) — Wiktor Stribiżew, Jul 13 '15 at 14:03

Regex replacing all ASCII character codes with actual characters

Edit

2 Answers2

Solution 1: XML-valid input

Solution 2: Non-valid XML input with HTML/XML entities

Solution 3: HtmlAgilityPack and HtmlEntity.DeEntitize