Delete all the characters in string S1 that occured in string S2

Question

Given two strings S1 and S2, S = S1 - S2 is defined to be the remaining string after taking all the characters in S2 from S1. How to calculate S1 - S2 for any given strings as fast as possible?

for example :

Input:

They are students.

aeiou

Output:

Thy r stdnts.

I've tried the hash map,sadlly the judger said that it is too slow,but can any solution be faster ?

Here is my code :

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
bool occur[300]={false};
int main()
{
    char str1[10002];
    gets(str1);
    char ch;
    while((ch=getchar())!='\n')
        occur[ch]=true;
    int i;
    for(i=0;i<strlen(str1);i++)
        if(occur[str1[i]])
            continue;
        else
            putchar(str1[i]);
    putchar('\n');
    return 0;
}

"I've tried the hash map" -- how? There are various ways in which a hash table can be used to solve this problem. Please describe your current algorithm. — Fred Foo, Mar 10 '13 at 10:42
Agreed with larsmans. The fastest solution to this problem necessarily involves a hash map or array indexing. — angelatlarge, Mar 10 '13 at 10:44
Also, while you didn't specify a language, I believe you have already wrote a program. Add your current implementation. — Zeta, Mar 10 '13 at 10:48
i dont think i am at a way but you can grab them out if you try just look on 1) http://en.wikipedia.org/wiki/Diff#Algorithm 2) http://www.codeproject.com/Articles/13326/An-O-ND-Difference-Algorithm-for-C 3)http://paulbutler.org/archives/a-simple-diff-algorithm-in-php/ i think these may help , i seen them a bit earlier — internals-in, Mar 10 '13 at 10:56
I add my code above , can anyone of you come up a better solution? — yvxiang, Mar 10 '13 at 13:35

score 2 · Accepted Answer · edited May 23 '17 at 12:11

2

I think you should:

create HashSet S containing all chars in S2
use List to which you append chars as you iterate through S1 that are not in S
Build string out of the list ("".join(list..) in Python)

I don't think there is faster way.. You could split S1 in N parts and work on this parallel - this is the only optimization that I see ...

As for your code - don't use strlen in loop condition! see: strlen: how does it work? . Just iterate over all chars till you get '\0' char or compute strlen once and put on variable which you use in loop condition ...

edited May 23 '17 at 12:11

Community

1
1

answered Mar 10 '13 at 10:48

kkonrad

1,262
13
32

2

Actually, if your characters are 8-bits, you should probably just use your characters as indeces into an array: hash functions are not really necessary here. Depending on the architecture, a 255-byte array probably is very affordable. – angelatlarge Mar 10 '13 at 10:50
thanks very much, I changed my code and the Online Judger accepted it~ – yvxiang Mar 10 '13 at 14:35

score 1 · Answer 2 · answered Mar 10 '13 at 10:50

1

If you can limit the problem down to a small alphabet (e.g. english characters only), you can create a bool array of the size of your alphabet instead.

1 array lookup will be much faster then hashing or traversing a binary tree.

answered Mar 10 '13 at 10:50

Tomas Grosup

6,396
3
30
44

score 0 · Answer 3 · answered Mar 10 '13 at 12:09

Probably one of the fastest and easiest way of doing this would be to use regular expression replacement. See sample python code below.

If you can't use regular expressions, you will need one loop over each character of the input string. Since you are woring on every single character, any algorithm will be at least O(n). This means that the only way to speed up the implementation is to reduce the time spent on checking if a character needs to be copied to the output or not and the actual copy to the output. Since I do not know what language you are using, I will give a short implementation in python. This uses the python set class which allows for constant time checks if value is in the set or not. The sample code is given below.

import re

def remove1(string, chars):
    return re.sub("[%s]"%chars, "", string)

def remove2(string, chars):
    chars = set(chars)
    res = ""
    for c in string:
        if c not in chars:
            res += c

    return res

import unittest

class TestRemove(unittest.TestCase):
    def test_removeVowels1(self):
        self.assertEqual("Thy r stdnts.", remove1("They are students.","aeiou"))

    def test_removeVowels1(self):
        self.assertEqual("Thy r stdnts.", remove2("They are students.","aeiou"))

if __name__=="__main__":
    unittest.main()

NOTE: If you are using a language like C++ and you know that the input is limited to 8-bit values, the fastest way is to use direct adressing; ie use the character value as an array index.

I show my code here,but the Online Judger still tells me "Time Limitted" — yvxiang, Mar 10 '13 at 13:39

alzaimar · Answer 4 · 2013-03-10T17:15:51.957

0

Technically, the Hashmap solution is O(n)+O(m), with n being the length of the sentence and m the amount of forbidden chars.

To my point of view, this is as fast as you can get as you have to run through the sentence deciding whether to keep or discard that character. Also, you have to run through all forbidden chars at least once in order to know them.

But, I can imagine that there are more efficient solutions as the ones presented, i.e. less overhead. But to be honest, I can't think of one.

Update (this is the most simple as it can get, but it's O(n*m). However, it might be faster than the other approaches for short strings):

foreach (c in sentence) 
  if (forbiddenChars.IndexOf(c) == -1) 
    Console.Write(c);

edited Mar 10 '13 at 17:15

answered Mar 10 '13 at 12:12

alzaimar

4,572
1
16
30

I add my code upon,I think it is the fatest solution ,but the Online Judger tells me "Time Limitted".. – yvxiang Mar 10 '13 at 13:34
Your 'online judger' definitely mixes up 'simplicity' and 'performance'. I will add a very simple version to my answer. – alzaimar Mar 10 '13 at 17:13

Delete all the characters in string S1 that occured in string S2

4 Answers4