2

Writing a file utility to strip out all non-ASCII characters from files. I have this Regex:

Regex rgx = new Regex(@"[^\u0000-\u007F]");

Which works fine. But unfortunatly, I've discovered some silly people use right angles (¬) as delimiters in their files, so these get stripped out as well, but I need those!

I'm pretty new to Regex, and I do understand the basics, but any help would be awesome!

Thanks in advance!

New Start
  • 1,401
  • 5
  • 21
  • 29
  • 3
    Because it is [¡⅁uoɹʍ puɐ ⅂IɅƎ](http://stackoverflow.com/questions/4174089/regular-expression-to-anglicize-string-characters/4174112#4174112), that’s why!!! – tchrist Nov 15 '10 at 11:23
  • @tchrist: Okay, I see your point, but I'm working with specific files that I know the content of and I know for certain that none of these would strip characters from other languages or anything like that. I think you should consider context before commenting! – New Start Nov 15 '10 at 11:29
  • The point is that ASCII is fifty years out of date; it’s from the **1960s** for goodness’ sake! If there are code points greater than 127 in your text, then they’re there for a reason and you should not blithely mutilate what someone else has gone to some trouble to produce. You never need to do this, and you never *should* do this. Please don’t **castrate** proper Unicoe text back into the dinosaur days before you were even born. Welcome to the New Millennium: text is **not ASCII!!** *əɹnʇnɟ əɥʇ oʇuᴉ ƨpɹɐʍʞɔɐq ʞlɐʍ ʇou op :noλ ⅁əq I* – tchrist Nov 15 '10 at 11:34
  • 9
    Look, I'm a student on placement right now and I've been thrown this utility that I have to write that I have no idea how to do. Right now, I'm experimenting with things and trying to figure out what to do, as well as hopefully learning as I do it. I have no idea how this utility will turn out, or how I'm going to make it work, but right now, I'm just trying things. So, please, PLEASE back off and let me try things. Take your preaching somewhere else. – New Start Nov 15 '10 at 11:41
  • If you assignment is to convert Unicode to ASCII, then please state as much. What specifically is your assignment? Often students bark up the wrong tree. If you were told to destroy the Unicode, then fine, but if not, what are you really trying to do? – tchrist Nov 15 '10 at 11:53
  • Fine, if I'm barking up the wrong tree, let me, how am I supposed to learn? Bombarding me with condescending comments just makes me feel like a bit of dick and doesn't help me at all. My utility is to basically 'fix' files we've been provided with. Including removing special characters. We know what the files need to contain to be 'fixed' to our standards. My spec mentions ASCII characters, so I'm experimenting with what I know, as I said. Now, unless you can actually help me with this problem, what's the point in our conversation? – New Start Nov 15 '10 at 12:03
  • If your assignment is to remove any code point outside the ASCII range, that’s pretty easy. The most obvious approach is to complement the ASCII set. ASCII is `[\x00-\x7F]`, so its complement is `[^\x00-\x7F`. Depending on which version of C# you’re using, you may be able to use charclass subtraction: e.g., `[\p{L}-[\p{IsBasicLatin}]]`, which is Letter characters above 256. There, does that help? – tchrist Nov 15 '10 at 12:13
  • Thank you, yes, but I had already done that myself, my problem was that I needed to include right angles (¬), as they are used as delimiters in some of the files. And before anything gets confused, There are 4 set delmiters in our files - ',' '|' ';' '¬' - and as you'll know, the only one not in the set is ¬ - So my Regex needs to sort that. SO, i tried all the answers and they still don't work.. Ideas? – New Start Nov 15 '10 at 12:17
  • The set of all characters that are neither ASCII nor U+00AC NOT SIGN is `[^\x00-\x7F\xAC]` or `[^\u00AC\u0000-\u007F]` is the first gets you into trouble. However, **this will not cure your "?" problem**, because the problem is not that you have non-ASCII (remember a NOT SIGN is non-ASCII, too). It is that you have an encoding specified incorrectly somewhere. Does that make sense? This is why sometimes just answering the question will not address the real problem: the questioner has misunderstood what the real problem is, and therefore overspecifies a solution to what is not his problem. – tchrist Nov 15 '10 at 12:21
  • Erm, yes, I think so.. But I haven't specified encoding anywhere in my program, or if that's not the case, I would have no idea how to change it. But I would have thought that when I applied the Regex, it would show the right angles in the output, even if it was just "????", because I know there placement in the file. Do you know what I mean? – New Start Nov 15 '10 at 12:28
  • Thank you for your help, my file was saved with ANSI enoding instead of Unicode, and that was causing all the hassle with the output and the Regex. And I'd like to think I didn't misunderstand my question, it's just from asking the question and getting the solution, another problem was discovered. I do thank you for your help, I'm grateful, but as I said before, being condescending never helps. – New Start Nov 15 '10 at 12:44

2 Answers2

4

You just need to include the code point for the angle bracket in the set:

Try this:

Regex rgx = new Regex(@"[^\uxxxx\u0000-\u007F]");

Or this:

Regex rgx = new Regex(@"[^\uxxxx-\uxxxx\u0000-\u007F]");

(Where xxxx is the Unicode code point for the character you want to preserve.)

The reason for giving two options here is that I know you can specify multiple ranges within one negative character group, but I don't know if you can match individual characters with ranges.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Why couldn’t you match individual characters within ranges? I don’t know any regex dialect where that won’t work, Java’s prepass conversion of `\uXXXX` before lexical analysis notwithstanding. – tchrist Nov 15 '10 at 11:41
  • I've tried both of these, as well as using the '|'. I really don't understand why it's not working. It's a console application if that makes any difference? But I don't think it should, right angles just appear as "?" without the Regex applied, but disappear when it is, so it must be being stripped by the Regex. Confused! – New Start Nov 15 '10 at 11:44
  • 1
    @New, when things are appearing as "?" characters that shouldn’t, that always points to an encoding problem. Somewhere something is thinking your text is in a different encoding than it really is. Usually this means you have to declare which encoding you’re truly using because the default doesn’t apply to the text in question. – tchrist Nov 15 '10 at 12:15
  • @New Start: You've accepted the answer, which suggests it's now working... could you give any more information? – Jon Skeet Nov 15 '10 at 12:54
  • Yes, sorry! My test input file was saved with ANSI encoding instead of Unicode, so as soon as I resaved it with the correct encoding, characters were appearing normally and my Regex started working correctly. I'm still not sure why my original Regex worked at all in the first place, it seems only a part of the Regex seemed to work with ANSI encoding.. – New Start Nov 15 '10 at 14:03
1

Jon's answer is absolutely correct. You may be using the wrong code for the character. Try the following for the similar looking characters:

Regex regex = new Regex(@"([^\u00ac\u0000-\u007F])");
Regex regex = new Regex(@"([^\u02fa\u0000-\u007F])");
Regex regex = new Regex(@"([^\u031a\u0000-\u007F])");

First one should work I think.

Yogesh
  • 14,498
  • 6
  • 44
  • 69