Remove chars from string

Question

I have a string like

string Text = "012345678901234567890123456789";

and a List<int> with indexes

List<int> Indexes = new List<int>() { 2, 4, 7, 9, 15, 18, 23, 10, 1, 2, 15, 40 };

with following restrictions

there are duplicates within the list
the list is not sorted
there may be indexes > Text.length

what's the best way to remove characters from the text which are in the index list?

expected output:

035681234679012456789

Is there a more efficent way than

foreach (int index in Indexes
                        .OrderByDescending(x => x)
                        .Distinct()
                        .Where(x => x < Text.Length))
{
    Text = Text.Remove(index, 1);
}

Update: Here are the Benchmarks of the current answers (string with 100.000 characters and List<int> with length 10.000:

Gallant: 3.322 ticks
Tim Schmelter: 8.602.576 ticks
Sergei Zinovyev: 9.002 ticks
rbaghbanli: 7.137 ticks
Jirí Tesil Tesarík: 72.580 ticks

The question is why do you need the list at all? Can't you use a `HashSet` in the first place instead? Then you don't have duplicates and a lookup with `Contains` is much more efficient O(1). — Tim Schmelter, Apr 11 '16 at 15:25

Tim Schmelter · Answer 1 · 2016-04-11T13:52:08.263

11

Here's a more or less elegant LINQ way:

Text = new string(Text.Where((c, index) => !Indexes.Contains(index)).ToArray());

It uses the overload of Enumerable.Where that projects the index of the item in the sequence.

If you want the most efficient and not the most readable way and the text is really large you could use a HashSet<int> instead of the list which doesn't allow duplicates and a StringBuilder to create the new string:

var indexSet = new HashSet<int>(Indexes); // either create from the list(as shown here) or use it without your list
var textBuilder = new StringBuilder(Text.Length);

for(int i = 0; i < Text.Length; i++)
    if (!indexSet.Contains(i))
        textBuilder.Append(Text[i]);
Text = textBuilder.ToString();

Of course you could also use the HashSet<int> in the LINQ approach to make it more efficient.

edited Apr 11 '16 at 13:52

answered Apr 11 '16 at 13:38

Tim Schmelter

450,073
74
686
939

That will be O(n^2). There is a solution with O(n). – Riad Baghbanli Apr 11 '16 at 13:46
2

If you use the `HashSet` in the Linq approach that would still be O(n). – juharr Apr 11 '16 at 13:48
Building binary tree is not fast. Whatever you win in the loop, you are loosing at the construction of binary tree. – Riad Baghbanli Apr 11 '16 at 14:00
@Tim Schmelter, if you get HashSet as an input - that is an assumption, don't you think? OP explicitly indicated that Indexes is List, not HashSet. – Riad Baghbanli Apr 11 '16 at 14:07
1

@rbaghbanli: you are making more assumptions. It could be that OP isn't ware of the `HashSet` class. Also, if the list doesn't change (often) you can also hold the set as lookup instance whereas you keep the list. Then you gain performance with a little bit more memory costs. However, it's worth to mention it even if OP knows it and can't use it. In many other cases it will help to improve performance. – Tim Schmelter Apr 11 '16 at 14:09
"It could be that OP isn't ware of the HashSet class" - assumption numero uno. "if the list doesn't change (often)" - assumption numero dos. So, to repeat one dude from stackoverflow: you are making more assumptions. – Riad Baghbanli Apr 11 '16 at 14:45
@rbaghbanli: i don't make assumptions. I just show you that you're making one if you think that creating the set is expensive. I show different ways. As always, it depends. In this case it might not be micro-optimization because the string and the list is large but often people tend to obfuscate their code because they think that it's more efficient. In realitiy they won't notice a difference but the code is much harder to understand. – Tim Schmelter Apr 11 '16 at 14:51
As per question and input conditions listed in the question and the answer containing construction of binary tree, it is not an assumption but rather direct indication of performance deficiency as per peer code review. Here are the example of assumptions: "In this case it might not be micro-optimization", "often people tend to obfuscate their code", "code is much harder to understand". Stop making assumptions about people making assumptions. – Riad Baghbanli Apr 11 '16 at 15:04

score 9 · Answer 2 · edited Apr 11 '16 at 15:54

9

This will work faster:

string Text = "012345678901234567890123456789";
List<int> Indexes = new List<int>() { 2, 4, 7, 9, 15, 18, 23, 10, 1, 2, 15, 40 };

HashSet<int> hashSet = new HashSet<int>(Indexes);

StringBuilder sb = new StringBuilder(Text.Length);
for (int i = 0; i < Text.Length; ++i)
{
    if (!hashSet.Contains(i))
    {
        sb.Append(Text[i]);
    }
}

string str = sb.ToString();

edited Apr 11 '16 at 15:54

Tunaki

132,869
46
340
423

answered Apr 11 '16 at 13:41

Sergei Zinovyev

1,238
14
14

HashSet hashSet = new HashSet(Indexes) - is not fast at all. – Riad Baghbanli Apr 11 '16 at 14:01
@rbaghbanli Compared to .OrderByDescending(x=>x).Distinct() it's fast. – Taemyr Apr 11 '16 at 14:40
True, but there are faster solutions. – Riad Baghbanli Apr 11 '16 at 14:42
Did you include the creation of HashSet within the test and what sizes you benchmarked it over? – Riad Baghbanli Apr 11 '16 at 15:06
Building HashSet is O(n^2). Have a look, for example, here: http://stackoverflow.com/questions/15322386/complexity-of-inserting-n-numbers-into-a-binary-search-tree – Riad Baghbanli Apr 11 '16 at 16:24
A for loop? Why is everyone suggesting for loops? Doesn't C# have a `filter` method on lists? – cat Apr 11 '16 at 20:03

Riad Baghbanli · Answer 3 · 2016-04-11T14:20:01.590

7

Yes, see code below (it will iterate only once over each sequence):

var map = new short[Text.Length];
foreach (var i in Indexes)
{
    if (i < text.Count)
        map[i] = 1;
}
Text = new string(Text.Where((c, i) => map[i] == 0).ToArray());

edited Apr 11 '16 at 14:20

answered Apr 11 '16 at 13:40

Riad Baghbanli

3,105
1
12
20

By calling `RemoveAt` you alter the indexes of the array. So this only works if you start with indexes at the end of the array and remove the duplicates. That's why the OP was using `OrderByDescending` and `Distinct`. – juharr Apr 11 '16 at 13:42
Judging by the code from OP that is desired behaviour. – Riad Baghbanli Apr 11 '16 at 13:44
No. Imagine if the `Indexes` as just the number 0 repeated 29 times. The OP's code would only remove the first character while your code would remove all of them. It's sad how many people don't realize that this code is completely wrong. – juharr Apr 11 '16 at 13:45
I am not sure that building binary tree is fast process. The longer the Indexes the longer it will take to build HashSet. – Riad Baghbanli Apr 11 '16 at 14:03
You could use `.Cast()` instead of the `Select(c => (char?)c)`. – juharr Apr 11 '16 at 14:07
Agree, but that would be the same performance, as it still creates new NullableObject. It can be faster if, for example, OP agrees that input strings has no char == 0, then we can use 0 instead of null, and that will be way faster as no object need to be created. That gives me an idea. Hold on! – Riad Baghbanli Apr 11 '16 at 14:11
Also calling `ToString` on an `IEnumerable` does not result in a string made up of those chars. You need something like `new string(text.Where(c => c.HasValue).Select(c => c.Value).ToArray());` – juharr Apr 11 '16 at 14:14
Instead of an array of `short`, why not just use an array of `bool`? Also shouldn't it be `map[i] == 0` because you want to keep the characters which have an index that isn't in `Indexes`. – juharr Apr 11 '16 at 14:19
Using short for readability, as it is clear short[] is initialized with 0's, where as bool[] initialization with false's is not as obvious. In fact bool's will be allocated as shorts anyway. Or even ints. – Riad Baghbanli Apr 11 '16 at 14:22
1

Why is it more obvious that `short` defaults to 0 than `bool` defaults to `false`? IMO `bool` is better because you can name the array something like `isInList` and then do `Where((c,i) => !isInList[i])`. And you avoid using magic numbers. – juharr Apr 11 '16 at 14:25
Actually, at logical level bool default value as false is not obvious. It is just representation of false as 0 (all 0's) and true as -1 (all 1's) makes it so. So, in practice, I agree. I just wanted to make it as easy to read as possible. – Riad Baghbanli Apr 11 '16 at 14:30

Gallant · Accepted Answer · 2016-04-12T17:14:14.993

The following is making an assumption that your string contains a known set of characters. If you know for certain that, for example, Unicode character ￰ never appears in the string, you can use it as a placeholder to mark which characters to remove. This should be very fast in exchange for this limitation:

char temp = '\uFFF0';
StringBuilder sb = new StringBuilder(Text);
for (int i = 0; i < Indexes.Count; i++)
{
    if (Indexes[i] < sb.Length)
    {
        sb[Indexes[i]] = temp;
    }
}

Text = sb.Replace(temp.ToString(), null).ToString();

This appears to be anywhere between 3-4 times faster than building a HashSet like some other answers have suggested. http://ideone.com/mUILHg

If you cannot make the above assumption, you can build an array to contain this extra bit of data instead of using a unique character. This does two rounds of iteration (so it's a bit slower), but it is still O(n) efficiency (so it should typically be faster than putting Indexes into a hashmap before iterating).

bool[] exclude = new bool[Text.Length];
for (int i = 0; i < Indexes.Count; i++)
{
    if (Indexes[i] < exclude.Length)
    {
        exclude[Indexes[i]] = true;
    }
}
StringBuilder sb = new StringBuilder(Text.Length);
for (int i = 0; i < Text.Length; i++)
{
    if (!exclude[i])
    {
        sb.Append(Text[i]);
    }
}
Text = sb.ToString();

Quick benchmarks: http://ideone.com/3d2uPH

Why didn't you use the null char? http://stackoverflow.com/questions/3670505/why-is-there-no-char-empty-like-string-empty — Byyo, Apr 12 '16 at 11:41
surprisingly char array gave me the same speed: `var a = Text.ToCharArray(); foreach (int i in Indexes) if (i < Text.Length) a[i] = ' '; var result = new string(a).Replace(" ", "");` — Slai, Apr 12 '16 at 14:08
never mind .. char array is faster for me when the `Indexes` are more than the characters, but gets slower when the characters are 10 times or more than the `Indexes` — Slai, Apr 12 '16 at 14:28
@Byyo That works as well, but valid strings can contain the null character (which is just Unicode character `\u0000`), so the issue remains. — Gallant, Apr 12 '16 at 16:24

Jiří Tesil Tesařík · Answer 5 · 2016-04-13T13:35:30.677

0

A modified solution using byte (may be replaced by boolean) array instead of hash table. PROS: linear complexity, CONS: needs extra memory for flag array.

string Text = "012345678901234567890123456789";
List<int> Indexes = new List<int>() { 2, 4, 7, 9, 15, 18, 23, 10, 1, 2, 15, 40 };
byte[] contains = new byte[Text.Length];
Indexes.ForEach(p=> {if ( p<Text.Length) contains[p]=1;});
var output = string.Concat(Enumerable.Range(0, Text.Length).Where(p => contains[p] != 1).Select(p => Text[p]));

edited Apr 13 '16 at 13:35

answered Apr 12 '16 at 11:25

Jiří Tesil Tesařík

11
1
3

Providing extra information to clarify your answer is a good practice, instead of just posting code. – Phiter Apr 12 '16 at 11:29

Remove chars from string

5 Answers5