I'm trying to remove strings with unrecognized characters from string collection. What is the best way to accomplish this?
-
2For example? How do you define "unrecognized characters"? – Oded Oct 24 '12 at 20:10
-
Characters that are not recognized are marked with diamond shape with "?" inside. I asume those characters are unicode formated, and ascii formation can't recognize them. – Rade Milovic Oct 24 '12 at 20:24
5 Answers
Since Array (assuming string[]
) is not re-sized when removing items you will need to create new one anyway. So basic LINQ filtering with ToArray()
will give you new array.
myArray = myArray.Where(s => !ContainsSpecialCharacters(s)).ToArray();

- 98,904
- 14
- 127
- 179
-
OP has "array" in title, but "collection" in question body. Regardless, I like your answer. – Michael Sallmen Oct 24 '12 at 20:13
-
I'm using the List
collection which allows me to remove elements easily. – Rade Milovic Oct 24 '12 at 20:25
To remove strings that contain any characters you don't recognize: (EG: if you want to accept lowercase letters, then "foo@bar" would be rejected")
- Create a regular expression which defines the set of "recognized" characters, and starts with ^ and ends with $. For example, if your "recognized" characters are uppercase A through Z, it would be
^[A-Z]$
- Reject strings that don't match
Note: This won't work for strings that contain newlines, but you can tweak it if you need to support that
To remove strings that contain entirely characters you don't recognize: (EG: If you want to accept lowercase letters, then "foo@bar" would be accepted because it does contain at least one lowercase letter)
- Create a regular expression which defines the set of "recognized" characters, but with a
^
character inside the square brackets, and starts with ^ and ends with $. For example, if your "recognized" characters are uppercase A through Z, it would be^[^A-Z]$
- Reject strings that DO match

- 121,657
- 64
- 239
- 328
-
Thanks, I allready used some regex to remove new lines and some special characters. – Rade Milovic Oct 24 '12 at 20:27
-
-
@RadeMilovic I'm not sure what you mean by "remove unknown formatted chars only" - Do you mean to reject strings that are *entirely composed of* unrecognized chars? – Orion Edwards Oct 29 '12 at 00:25
I would look at Linq's where method, along with a regular expression containing the characters you're looking for. In pseudocode:
return myStringCollection.Where(!s matches regex)

- 916
- 3
- 12
- 28
this does what you seem to want.
List<string> strings = new List<string>()
{
"one",
"two`",
"thr^ee",
"four"
};
List<char> invalid_chars = new List<char>()
{
'`', '-', '^'
};
strings.RemoveAll(s => s.Any(c => invalid_chars.Contains(c)));
strings.ForEach(s => Console.WriteLine(s));
generates output:
one
four

- 14,072
- 4
- 37
- 49
-
No, I want to remove characters that are in other format and can't be recognized inside ascii formated string. – Rade Milovic Oct 24 '12 at 20:29
-
This question has some similar answers to what I think you are looking for. However, I think you want to include all letters, numbers, whitespace and punctuation, but exclude everything else. Is that accurate? If so, this should do it for you:
char[] arr = str.ToCharArray();
arr = Array.FindAll<char>(arr, (c => (char.IsLetterOrDigit(c) ||
char.IsWhiteSpace(c) || char.IsPunctuation(c))));
str = new string(arr);

- 1
- 1

- 4,374
- 2
- 27
- 40
-
No, I want to remove characters that are in other format and can't be recognized inside ascii formated string. – Rade Milovic Oct 24 '12 at 20:53