139

I had an interview with a hedge fund company in New York a few months ago and unfortunately, I did not get the internship offer as a data/software engineer. (They also asked the solution to be in Python.)

I pretty much screwed up on the first interview problem...

Question: Given a string of a million numbers (Pi for example), write a function/program that returns all repeating 3 digit numbers and number of repetition greater than 1

For example: if the string was: 123412345123456 then the function/program would return:

123 - 3 times
234 - 3 times
345 - 2 times

They did not give me the solution after I failed the interview, but they did tell me that the time complexity for the solution was constant of 1000 since all the possible outcomes are between:

000 --> 999

Now that I'm thinking about it, I don't think it's possible to come up with a constant time algorithm. Is it?

Vikas Yadav
  • 3,094
  • 2
  • 20
  • 21
its.david
  • 2,115
  • 5
  • 16
  • 31
  • 68
    If they think the solution is a constant of 1000, then that makes me think that they would have built all the three-digit numbers, and then regex searched for them. It's very common for people to think that operations they didn't actually write/see are "free". I'm pretty sure this would be linear to the length of the string. – mypetlion Nov 30 '17 at 19:41
  • @mypetlion thanks for the response, what do you mean by regex searched them? Is that some sort of concept like Bitwise operations? – its.david Nov 30 '17 at 19:44
  • 1
    No, regex (or Regular Expressions) are a string searching/matching tool. So what I mean is that they would have got all the three digit numbers as strings and then used a tool to find all the occurrences of those number strings within the longer string. You can read about the Python implementation of regex here: https://docs.python.org/3/library/re.html and the general concept here: http://www.aivosto.com/vbtips/regex.html – mypetlion Nov 30 '17 at 19:51
  • 55
    Nitpickingly, if the input size is a constant, every algorithm is constant time ;-) – Paŭlo Ebermann Nov 30 '17 at 21:57
  • 34
    a constant of 1000 _what_? (additions? elephants?) – ilkkachu Nov 30 '17 at 22:28
  • 1
    @ilkkachu - exactly, what's being counted is important. On the one hand it is possible the interviewer meant "linear in the length of the string". On the other hand, since the length of the string was _fixed_ at 1M, the interviewer might have meant "O(1)" in terms of time (or space) on the accumulating data structure. E.g., since there are only 1000 three digit numbers use an array to hold the counts, not a linked list. BTW, OP, in a case like this it is always valid to ask the interviewer _what is being counted_ as an operation. – davidbak Dec 01 '17 at 01:46
  • 31
    Well, if the string length is constant (1M) and the substring/number length is constant (3), then technically every solution is constant time… – Kevin Dec 01 '17 at 01:55
  • you can get somewhat close to constant time if you have enough parallel processing resources and can execute all the comparisons in parallel. obviously there is some contention on the atomic increments, but for random data this will approach a small logarithmic rate. also, for some of these kinds of algorithms, doing them in hardware makes some things possible that aren't in linear SW for the same reason. – Corley Brigman Dec 01 '17 at 08:07
  • 1
    Reading this, three things stand out: 1) This was the very first question of the interview, 2) they specifically mention Pi (an irrational number) and 3) the ambiguous "constant of 1000". As the answers state it's not possible, making me wonder whether this was a trick question and with a sufficiently long test string, all combinations could be expected to repeat an equal number of times. Was this an interactive question between you an interviewer? Just a different perspective to the `O(n)` answers if you got the impression it had to be constant time. – roganjosh Dec 01 '17 at 10:18
  • 8
    `They did not give me the solution after I failed the interview, but they did tell me that the time complexity for the solution was constant of 1000 since all the possible outcomes are between: 000 --> 999` This was likely the actual test. To see if you could prove to them why this is not possible and to show them the correct minimum time complexity. – James Dec 01 '17 at 15:31
  • 2
    @TechMedicNYC: My thoughts too. Assuming the interviewer wasn't stupid, he probably wanted to see how long it would take OP to answer "Wait a minute! That's not possible!". – Eric Duminil Dec 01 '17 at 17:45
  • 1
    @roganjosh: You have a point. Every 1000 substrings is present in the first 1E6 decimals of pi, the least frequent appears 898 times and the most frequent one appears 1092 times. Simply returning the 1000 substrings with a frequency of approximately 1000 times wouldn't be too wrong. – Eric Duminil Dec 01 '17 at 17:58
  • 1
    All this talk of O(n) vs O(1) reminds me of when I joked to a friend that Java's `ArrayList#contains` is technically O(1), since Java arrays are capped (by spec, not just implementation) at 2^31-1 elements. :-) – yshavit Dec 01 '17 at 20:16
  • 2
    Obligatory honest description of a [Job Interview](https://www.youtube.com/watch?v=YHy06FMsezI). – code_dredd Dec 01 '17 at 22:19
  • It could be done with O(1) constant time complexity, but you'd need huge space complexity: a **lookup table** with 10^6 inputs and 10^3 6-bit outputs, one for each 3-digit string combination xyz. So 10^9 bytes of storage, i.e. an SSD with 1Gb useable space. However to sum the intermediate results you'd also need a huge **array of 1000 x depth-20 trees of adders to sum the partial results** (can't use [Wallace trees](https://en.wikipedia.org/wiki/Wallace_tree) because they're O(N) time-complexity). Theoretically possible but in reality silly for a very regular map-reduce type problem like this. – smci Dec 02 '17 at 00:28
  • 1
    @smci: I don't get it. You'd still need to iterate at least once on each decimal of pi, if only to find it in the lookup table. Is your plan really to save the result for every number with 1E6 decimals? Your input size would be `10**(10**6)` then. – Eric Duminil Dec 02 '17 at 10:12
  • @EricDuminil exactly my thinking, I've been thrown curve-balls before where the expectation is completely unknown but not in this field. They stress you out. In this case, I think it would be safer to assume that the answer is along the lines of "Did you want it O(1) ? The closest you can get is an approximation that probabilistically gets better with longer inputs. However, here's a correct O(n) answer". Not safe to assume the asker is simply dumb as some answers suggest. – roganjosh Dec 02 '17 at 11:39
  • 1
    @ezzzCash Do you mean something like [Demo](http://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=ef5b6844e1256782571ef94514313dda)? – Lukasz Szozda Dec 02 '17 at 16:11
  • @lad2025 hey thanks for taking the time to make a demo using SQL. I wasn't particularly looking for a demo but if it was, I wanted it in python (like what I said above and which the answer has already been given). However, thanks for doing it in SQL. Always nice to see how it could also be implemented using a server language – its.david Dec 02 '17 at 16:55
  • Giving them the benefit of the doubt, I'd guess they mean `O(1000 + n)`: Under any implementation reading the string is (at least) `O(n)` and if you play your cards right, the rest of it is `O(1000)` (iterating the buckets after you've finished). So if you assume every solution must read the input, and consider only how you store, aggregate and retrieve the answers, their statement makes sense. Otherwise, they are loonies, but that seems less likely than they are just not conveying their ideas effectively. – TemporalWolf Dec 02 '17 at 22:18
  • Yeah I agree when your reasoning. Considering it was the first round and it was only 30min, I was give 15min to solve it and I guess they're trying to rush it and explain it me. I hold the most fault but they weren't perfect either. – its.david Dec 02 '17 at 23:22
  • 1
    Honestly, if they think it is constant 1000, you should be happy they didn't hire you. You need somebody that actually know what they are doing with your first job to give you proper mentorship....Speak from experience... – user3682091 Dec 03 '17 at 23:24
  • 1
    Yeah I wasn’t thinking that at first but now with the number of helpful advice and answers, maybe the rejection was the good thing – its.david Dec 03 '17 at 23:49
  • In c I would do it as sums of absolute values of circular convolution [1,-1,0] [0,1,-1] on a char ring buffer. No idea how to do that the best way in Python though. – mathreadler Jan 28 '18 at 12:07

13 Answers13

169

You got off lightly, you probably don't want to be working for a hedge fund where the quants don't understand basic algorithms :-)

There is no way to process an arbitrarily-sized data structure in O(1) if, as in this case, you need to visit every element at least once. The best you can hope for is O(n) in this case, where n is the length of the string.

Although, as an aside, a nominal O(n) algorithm will be O(1) for a fixed input size so, technically, they may have been correct here. However, that's not usually how people use complexity analysis.

It appears to me you could have impressed them in a number of ways.

First, by informing them that it's not possible to do it in O(1), unless you use the "suspect" reasoning given above.

Second, by showing your elite skills by providing Pythonic code such as:

inpStr = '123412345123456'

# O(1) array creation.
freq = [0] * 1000

# O(n) string processing.
for val in [int(inpStr[pos:pos+3]) for pos in range(len(inpStr) - 2)]:
    freq[val] += 1

# O(1) output of relevant array values.
print ([(num, freq[num]) for num in range(1000) if freq[num] > 1])

This outputs:

[(123, 3), (234, 3), (345, 2)]

though you could, of course, modify the output format to anything you desire.

And, finally, by telling them there's almost certainly no problem with an O(n) solution, since the code above delivers results for a one-million-digit string in well under half a second. It seems to scale quite linearly as well, since a 10,000,000-character string takes 3.5 seconds and a 100,000,000-character one takes 36 seconds.

And, if they need better than that, there are ways to parallelise this sort of stuff that can greatly speed it up.

Not within a single Python interpreter of course, due to the GIL, but you could split the string into something like (overlap indicated by vv is required to allow proper processing of the boundary areas):

    vv
123412  vv
    123451
        5123456

You can farm these out to separate workers and combine the results afterwards.

The splitting of input and combining of output are likely to swamp any saving with small strings (and possibly even million-digit strings) but, for much larger data sets, it may well make a difference. My usual mantra of "measure, don't guess" applies here, of course.


This mantra also applies to other possibilities, such as bypassing Python altogether and using a different language which may be faster.

For example, the following C code, running on the same hardware as the earlier Python code, handles a hundred million digits in 0.6 seconds, roughly the same amount of time as the Python code processed one million. In other words, much faster:

#include <stdio.h>
#include <string.h>

int main(void) {
    static char inpStr[100000000+1];
    static int freq[1000];

    // Set up test data.

    memset(inpStr, '1', sizeof(inpStr));
    inpStr[sizeof(inpStr)-1] = '\0';

    // Need at least three digits to do anything useful.

    if (strlen(inpStr) <= 2) return 0;

    // Get initial feed from first two digits, process others.

    int val = (inpStr[0] - '0') * 10 + inpStr[1] - '0';
    char *inpPtr = &(inpStr[2]);
    while (*inpPtr != '\0') {
        // Remove hundreds, add next digit as units, adjust table.

        val = (val % 100) * 10 + *inpPtr++ - '0';
        freq[val]++;
    }

    // Output (relevant part of) table.

    for (int i = 0; i < 1000; ++i)
        if (freq[i] > 1)
            printf("%3d -> %d\n", i, freq[i]);

    return 0;
}
paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • 19
    This "fixed input size" really sounds like a bad joke either the interviewer or the interviewee didn't get. Every algorithm becomes `O(1)` is `n` is fixed or bounded. – Eric Duminil Dec 01 '17 at 13:11
  • 5
    If they need better than that, maybe they shouldn't be using Python, at least for the specific algorithm. – Sebastian Redl Dec 01 '17 at 13:28
  • @paxidiablo, thank so much for the detailed response. Not only did you tackle the problem and answered my question, but you also described the interview process and how to do well next time. I'll keep your advice in mind for the next interview also why is it `len(inpStr) - 2` (why minus 2?) – its.david Dec 01 '17 at 21:10
  • 3
    @ezzzCash Because there may be overlap at the points where the string is being "broken up" when trying a parallel approach. Since you're looking for 3-digit groups, -2 allows the check on both parallel groupings to *not* miss a potentially valid match. – code_dredd Dec 01 '17 at 22:32
  • @Ray, I still don't understand the idea? I think I'm lacking in parallel programming knowledge? Is there a resource that I can look into to make me understand why it's -2? – its.david Dec 02 '17 at 02:11
  • 5
    @ezzzCash It's not a lack of parallel programming knowledge. Consider a string of length `N`. If you break it into two parts at position `N/2`, you still need to account for the fact that you could miss a valid 3-digit match at the "border", at the end of `string1` and the beginning of `string2`. Thus, you need to check matches between `string1[N/2-2]` and `string2[2]` (using a zero-based index), etc. That's the idea. – code_dredd Dec 02 '17 at 03:28
  • @ray Okay I see it. Gosh that's so simple, how did I not understand that aha. Thanks for taking the time to explain it to me :) – its.david Dec 02 '17 at 03:49
  • 1
    With longer digit-sequences, there'd be something to gain from optimizing the conversion to integer with a sliding window that lets you drop the highest digit and add a new digit. (Python overhead would probably kill this, so it would only apply to C or other low-level implementations). `val -= 100 * (d[i]-'0');` to drop the leading digit. `val = 10*val + d[i+2]-'0'` to accumulate a new least-significant digit (normal string->integer parsing). `val % 100` is possibly non-horrible, but only if `100` is a compile-time constant so it doesn't use a real HW divide. – Peter Cordes Dec 02 '17 at 10:09
  • Or maybe optimize the string->int with SIMD: https://stackoverflow.com/questions/35127060/how-to-implement-atoi-using-simd/. Or save temporary values during string->int conversion so you don't have to undo the most significant digit in the first place. (Of course, with longer substrings, the histogram size grows exponentially, so you'd want to start maybe making multiple passes over the array using only a single prefix to keep some memory locality in your histogram. But then there must be ways to optimize across multiple passes...) – Peter Cordes Dec 02 '17 at 10:13
  • @PeterCordes - in one of my comments following my answer, I mentioned the option of getting 4 indexes at a time, num = int(str[i:i+6]) ... i += 4, (grab 6 digits at a time, advance by 4 digits to handle overlap) then using division and modulo to pick out the set of four 3 digit values. As you mentioned, this assumes that divide by constant is optimized to multiply + shift. Considering python can be 100+ times slower than C / C++, I'm not sure if division overhead in Python is that significant. – rcgldr Dec 02 '17 at 11:19
  • 1
    @rcgldr: That seems bad vs. saving temporary values (e.g. using `10*d4 + d5` as part of computing both `100*d3 + 10*d4 + d5` and `(10*d4 + d5) * 10 + d6`). Maybe in Python where the cost of gluing together operations is much higher than the cost of the operation itself. It's easy to say "if you care about performance don't use Python this way in the first place", but sometimes just finding an efficient way in Python is enough to not bother making a C or asm implementation. But note that actual HW division has ~30x (32-bit) or ~90x (64-bit int) the throughput cost of integer mul on x86. – Peter Cordes Dec 02 '17 at 11:30
  • @PeterCordes - I had the impression that the main bottleneck is converting a string to an integer: int(str[start:end]), so I convert 6 digits at a time (although only advancing 4 at a time). My comment about the division overhead being masked by Python overhead is assuming that division (or modulo) by a constant was implemented as multiply + shift (+ subtract for modulo). I looked at a couple of web site benchmarks and python was about 100+ times slower than C. Perhaps this was an interpreted versus compiled version of Python. Java uses a virtual machine, but it's no where near as slow. – rcgldr Dec 02 '17 at 13:44
  • @rcgldr: Right, modern JVMs JIT-compile and do a reasonable job for hot functions. Most Python interpreters only interpret. IDK what the state of the art is, I don't use Python. Int->string overhead presumably depends on the length of the string, unless Python overhead is dwarfs it. For 3-digit numbers, it should be very cheap. – Peter Cordes Dec 02 '17 at 13:49
  • How does `freq = [inpStr.count(str(n) for n in range(1000) if inpStr.count(str(n)>1]` compare? Or using a Counter object? – Acccumulation Dec 04 '17 at 03:07
  • 1
    @Acccumulation, surely that's something you could test yourself. I'm assuming you didn't come up with that list comprehension as a Python beginner so I'd suspect you have an environment available to you :-) – paxdiablo Dec 04 '17 at 03:29
79

Constant time isn't possible. All 1 million digits need to be looked at at least once, so that is a time complexity of O(n), where n = 1 million in this case.

For a simple O(n) solution, create an array of size 1000 that represents the number of occurrences of each possible 3 digit number. Advance 1 digit at a time, first index == 0, last index == 999997, and increment array[3 digit number] to create a histogram (count of occurrences for each possible 3 digit number). Then output the content of the array with counts > 1.

rcgldr
  • 27,407
  • 3
  • 36
  • 61
  • Okay thanks for the response. Would a dictionary work? the key would be the 3 digits repeating and the value would be the number of times it would repeat? As you go through the string, you would add new key value pairs to your dictionary (although, that dictionary would be quite long) – its.david Nov 30 '17 at 20:07
  • 26
    @ezzzCash - yes a dictionary would work, but it's not needed. All possible "keys" are known in advance, limited to the range 0 to 999. The difference in overhead would be the time it takes to do a key based access using 3 character strings as keys, versus the time it takes to convert a 3 digit string to an index and then using the index to access the array. – rcgldr Nov 30 '17 at 20:14
  • 4
    If you want numeric tricks, you could also decide to go BCD and store the three digits in 12 bits. And decode ASCII digits by masking the low 4 bits. But that `x-'0'` pattern is not valid in Python, it's a C-ism (where characters are integers). – Yann Vernier Nov 30 '17 at 20:32
  • A dictionary would be much slower than a simple array. – Loren Pechtel Nov 30 '17 at 23:54
  • 5
    @LorenPechtel: Dictionary lookups in Python are really fast. Granted, array access is even faster, so if we were dealing with integers from the beginning, you'd be right. However, in this case, we have 3-length strings, which we first have to convert to integers if we want to use them with arrays. It turns out that contrary to what one might first expect, the dictionary lookup is actually faster than the integer conversion + array access. The array solution is in fact 50% slower in this case. – Aleksi Torhamo Dec 01 '17 at 10:44
  • @AleksiTorhamo That's really weird. Converting a string to a number should not be more expensive than hashing it, which is what the dictionary lookup has to do. – Sebastian Redl Dec 01 '17 at 13:24
  • @AleksiTorhamo - Possible optiizations. initialize arr[] to 1000 zeroes, which would eliminated the need to check for empty elements. Pick up 4 indices at a time: | num = int(str[i:i+6]) | arr[num // 1000] += 1 | arr[num % 1000] += 1 | num = (num % 100000) // 10 | arr[num // 10] += 1 | arr[num % 1000] += 1 | i += 4 | ... – rcgldr Dec 01 '17 at 14:21
  • 2
    I guess one could argue that if the input number has _always exactly_ 1 million digits, than that algorithm _is_ O(1), with a constant factor of 1 million. – tobias_k Dec 01 '17 at 14:57
  • 2
    @AleksiTorhamo - If the goal is to compare relative speeds of implementations for an algorithm, I would prefer a traditional language like C or C++, since Python is significantly slower and seems to have overheads unique to Python compared to other languages. – rcgldr Dec 01 '17 at 16:16
  • For purposes of pedantry - the problem is of constant complexity - the input is fixed length of 1,000,000 numbers. Therefore its time complexity is O(1) (and O(1000) and O(10^10^10^10) - these are all the same complexity). In an interview you may be able to score some points by being clear on this but also being able to articulate how your algorithm would scale with a variable length input. – deadcode Dec 01 '17 at 19:18
  • @rcdldr - Technically correct is the best kind of correct ;). I think since the interviewer introduced the notion of time complexity, this is an opportunity to show your knowledge of time complexity, and there is a big risk to saying: Here is my O(N) algorithm and have them correct you / mark you down for not understanding / arguing with them that the problem can be solved in constant time. I think the fullest, most correct interview response is to acknowledge this problem is solvable in constant time since the input is limited, then show your good solution that scales with variable input. – deadcode Dec 01 '17 at 21:49
  • @tobias_k - It doesn't matter if the input always has exactly 1 million digits, the term [constant time](https://en.wikipedia.org/wiki/Time_complexity#Constant_time) only applies if the algorithm's time complexity is independent of the size of the input. – rcgldr Dec 02 '17 at 01:32
  • @deadcode - It doesn't matter if the input always has exactly 1 million digits, the term [constant time](https://en.wikipedia.org/wiki/Time_complexity#Constant_time) only applies if the algorithm's time complexity is independent of the size of the input. – rcgldr Dec 02 '17 at 01:32
  • @rcgldr - You are missing an important distinction: if an algorithm is specified to only be able to take a constant size input (as is the case here) - its runtime is constant. What is the runtime of enumerating all legal configurations of a chessboard? Constant (because there are only a constant number of such configurations). What is the runtime of printing the position and velocity of every particle in the universe? Constant (because there are a finite number of particles). What is the runtime of printing the length of an arbitrary string? O(N) - because the string's length is not bounded. – deadcode Dec 02 '17 at 05:57
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/160327/discussion-between-deadcode-and-rcgldr). – deadcode Dec 02 '17 at 07:20
  • You can think of it that way: Instead of having a loop over all the digits in the input string, you could replace the code with literally one million lines, each checking a different digit in the fixed-length input. But as @deadcode said, that's really a bit pedantic and besides any practicality. – tobias_k Dec 02 '17 at 09:54
  • As mentioned in paxdiablo's answer, I also consider this as "suspect reasoning" of what I consider mis-usage of the term "constant time". – rcgldr Dec 02 '17 at 11:13
  • @rcgldr If the problem space has only one element, then technically all algorithms are independent of the input size. – Acccumulation Dec 04 '17 at 03:11
  • @ Aleksi Torhamo You should avoid saying "50% slower". It's ambiguous between "50% more time" and "100% more time. – Acccumulation Dec 04 '17 at 03:12
15

A million is small for the answer I give below. Expecting only that you have to be able to run the solution in the interview, without a pause, then The following works in less than two seconds and gives the required result:

from collections import Counter

def triple_counter(s):
    c = Counter(s[n-3: n] for n in range(3, len(s)))
    for tri, n in c.most_common():
        if n > 1:
            print('%s - %i times.' % (tri, n))
        else:
            break

if __name__ == '__main__':
    import random

    s = ''.join(random.choice('0123456789') for _ in range(1_000_000))
    triple_counter(s)

Hopefully the interviewer would be looking for use of the standard libraries collections.Counter class.

Parallel execution version

I wrote a blog post on this with more explanation.

Paddy3118
  • 4,704
  • 27
  • 38
  • It works fine and seems to be the fastest, non numpy solution. – Eric Duminil Dec 01 '17 at 13:53
  • 3
    @EricDuminil, I don't think you should worry about having the fastet timings here, when most solutions given won't delay you much. Far better to show that you have a good grasp of the Python standard library and can write maintainable code in an interview situation I would think. (Unless the interviewer *stressed* the time criticality whereupon you should ask for actual timings before assessing what comes next). – Paddy3118 Dec 01 '17 at 14:19
  • 1
    We agree 100%. Though I'm not sure any answer is relevant at all if the interviewer really thinks it's possible to do in `O(1)`. – Eric Duminil Dec 01 '17 at 14:34
  • 1
    If the interviewer stressed it was time critical, then, **after** profiling to confirm this is the limit, it may be time to write a C module to address this bottleneck. I have a script that saw an 84x improvement over python code after we switched to using a c module. – TemporalWolf Dec 01 '17 at 23:13
  • Hi @TemporalWolf, I read what you said then thought that another, faster, and scaleable solution might be to change it to a parallel algorithm so it could be run on many processes on a compute farm/cloud. You have to split the string into n sections; overlapping the last 3 chars of each section with its next section. Each section can then be scanned for triples independently, the triples summed, and the three char triple at the end of all but the last section subtracted out as it would have been double counted. I have the code, and will probably turn it into a blog post... – Paddy3118 Dec 03 '17 at 21:22
  • Parallelizing, especially across multiple machines, incurs a significant overhead. For a string of 1M characters, it is likely overkill. The more we optimize for speed, the greater the maintenance burden of the code. So we need to justify that burden beforehand (via profiling). – TemporalWolf Dec 03 '17 at 23:28
  • Of course we need to measure before any optimisation that takes you away from pythonic code. Instead of just wondering about a multi core solution I decided to write the code so I get a first hand idea of the effort. Multi cores are everywhere! – Paddy3118 Dec 04 '17 at 07:02
13

The simple O(n) solution would be to count each 3-digit number:

for nr in range(1000):
    cnt = text.count('%03d' % nr)
    if cnt > 1:
        print '%03d is found %d times' % (nr, cnt)

This would search through all 1 million digits 1000 times.

Traversing the digits only once:

counts = [0] * 1000
for idx in range(len(text)-2):
    counts[int(text[idx:idx+3])] += 1

for nr, cnt in enumerate(counts):
    if cnt > 1:
        print '%03d is found %d times' % (nr, cnt)

Timing shows that iterating only once over the index is twice as fast as using count.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Daniel
  • 42,087
  • 4
  • 55
  • 81
  • 37
    Is there a black friday discount on `text.count()`? – Eric Duminil Nov 30 '17 at 19:57
  • 3
    @EricDuminil You do have a good point but, since `text.count` is done in a high-speed compiled language (e.g. C) as opposed to slow python-level interpreted looping, yes there is a discount. – John1024 Nov 30 '17 at 20:02
  • It's very inefficient to count each number separately but it's a constant time, thus still O(n). – Loren Pechtel Nov 30 '17 at 23:55
  • 11
    The option you proposed that uses `count` is incorrect, since it won't count overlapping patterns. Note that `'111'.count('11') == 1` when we would expect it to be `2`. – Cireo Dec 01 '17 at 00:50
  • 2
    Also, your "simple `O(n)` solution" is actually `O(10**d * n)` with `d` the number of searched digits and `n` the total length of the string. The second one is `O(n)` time and `O(10**d + n)` space. – Eric Duminil Dec 01 '17 at 13:07
10

Here is a NumPy implementation of the "consensus" O(n) algorithm: walk through all triplets and bin as you go. The binning is done by upon encountering say "385", adding one to bin[3, 8, 5] which is an O(1) operation. Bins are arranged in a 10x10x10 cube. As the binning is fully vectorized there is no loop in the code.

def setup_data(n):
    import random
    digits = "0123456789"
    return dict(text = ''.join(random.choice(digits) for i in range(n)))

def f_np(text):
    # Get the data into NumPy
    import numpy as np
    a = np.frombuffer(bytes(text, 'utf8'), dtype=np.uint8) - ord('0')
    # Rolling triplets
    a3 = np.lib.stride_tricks.as_strided(a, (3, a.size-2), 2*a.strides)

    bins = np.zeros((10, 10, 10), dtype=int)
    # Next line performs O(n) binning
    np.add.at(bins, tuple(a3), 1)
    # Filtering is left as an exercise
    return bins.ravel()

def f_py(text):
    counts = [0] * 1000
    for idx in range(len(text)-2):
        counts[int(text[idx:idx+3])] += 1
    return counts

import numpy as np
import types
from timeit import timeit
for n in (10, 1000, 1000000):
    data = setup_data(n)
    ref = f_np(**data)
    print(f'n = {n}')
    for name, func in list(globals().items()):
        if not name.startswith('f_') or not isinstance(func, types.FunctionType):
            continue
        try:
            assert np.all(ref == func(**data))
            print("{:16s}{:16.8f} ms".format(name[2:], timeit(
                'f(**data)', globals={'f':func, 'data':data}, number=10)*100))
        except:
            print("{:16s} apparently crashed".format(name[2:]))

Unsurprisingly, NumPy is a bit faster than @Daniel's pure Python solution on large data sets. Sample output:

# n = 10
# np                    0.03481400 ms
# py                    0.00669330 ms
# n = 1000
# np                    0.11215360 ms
# py                    0.34836530 ms
# n = 1000000
# np                   82.46765980 ms
# py                  360.51235450 ms
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • Probably significantly faster to flatten the digit string instead of having nested bins, unless NumPy ends up implementing it as a 3D matrix with efficient indexing. Which version of @Daniel's did you time against; the one that runs a string search for each integer, or the one with a histogram? – Peter Cordes Dec 02 '17 at 09:50
  • 2
    @PeterCordes I doubt it. `ndarray`s, the core numpy type, are all about efficient storage, manipulation and indexing of multidimensional arrays of numbers. Sometimes you can shave off a few % by flattening, but in this case doing the 100 x[0] + 10 x[1] + x[2] by hand won't gain you much. I used the one @Daniel said was faster, you can check the benchmark code yourself. – Paul Panzer Dec 02 '17 at 13:35
  • I don't really know NumPy (or Python in general; mostly I do C and assembly performance tuning for x86), but I think you have a single 3D array, right? I was thinking from your English text (which apparently I didn't even read carefully) that you had actual nested Python objects and were indexing them separately. But that's not the case, so nvm my first comment. – Peter Cordes Dec 02 '17 at 13:54
  • I think the pure Python version you used is pretty much the same histogram implementation that the even higher voted answers used, but if different ways of writing it in Python affect the speed much. – Peter Cordes Dec 02 '17 at 13:57
3

I would solve the problem as follows:

def find_numbers(str_num):
    final_dict = {}
    buffer = {}
    for idx in range(len(str_num) - 3):
        num = int(str_num[idx:idx + 3])
        if num not in buffer:
            buffer[num] = 0
        buffer[num] += 1
        if buffer[num] > 1:
            final_dict[num] = buffer[num]
    return final_dict

Applied to your example string, this yields:

>>> find_numbers("123412345123456")
{345: 2, 234: 3, 123: 3}

This solution runs in O(n) for n being the length of the provided string, and is, I guess, the best you can get.

paho
  • 1,162
  • 9
  • 13
  • 1
    You could simply use a `Counter`. You don't need a `final_dict`, and you don't have to update it at each iteration. – Eric Duminil Nov 30 '17 at 20:28
2

As per my understanding, you cannot have the solution in a constant time. It will take at least one pass over the million digit number (assuming its a string). You can have a 3-digit rolling iteration over the digits of the million length number and increase the value of hash key by 1 if it already exists or create a new hash key (initialized by value 1) if it doesn't exists already in the dictionary.

The code will look something like this:

def calc_repeating_digits(number):

    hash = {}

    for i in range(len(str(number))-2):

        current_three_digits = number[i:i+3]
        if current_three_digits in hash.keys():
            hash[current_three_digits] += 1

        else:
            hash[current_three_digits] = 1

    return hash

You can filter down to the keys which have item value greater than 1.

Abhishek Arora
  • 944
  • 2
  • 10
  • 16
2

As mentioned in another answer, you cannot do this algorithm in constant time, because you must look at at least n digits. Linear time is the fastest you can get.

However, the algorithm can be done in O(1) space. You only need to store the counts of each 3 digit number, so you need an array of 1000 entries. You can then stream the number in.

My guess is that either the interviewer misspoke when they gave you the solution, or you misheard "constant time" when they said "constant space."

Cort Ammon
  • 10,221
  • 31
  • 45
  • As others have pointed out, the histogram approach is `O(10**d)` extra space, where `d` is the number of decimal digits you're looking for. – Peter Cordes Dec 02 '17 at 09:56
  • 1
    Dictionary approach would be O (min (10^d, n)) for n digits. For example if you have n = 10^9 digits and want to find the rare 15 digit sequences that occur more than once. – gnasher729 Dec 02 '17 at 14:57
1

Here's my answer:

from timeit import timeit
from collections import Counter
import types
import random

def setup_data(n):
    digits = "0123456789"
    return dict(text = ''.join(random.choice(digits) for i in range(n)))


def f_counter(text):
    c = Counter()
    for i in range(len(text)-2):
        ss = text[i:i+3]
        c.update([ss])
    return (i for i in c.items() if i[1] > 1)

def f_dict(text):
    d = {}
    for i in range(len(text)-2):
        ss = text[i:i+3]
        if ss not in d:
            d[ss] = 0
        d[ss] += 1
    return ((i, d[i]) for i in d if d[i] > 1)

def f_array(text):
    a = [[[0 for _ in range(10)] for _ in range(10)] for _ in range(10)]
    for n in range(len(text)-2):
        i, j, k = (int(ss) for ss in text[n:n+3])
        a[i][j][k] += 1
    for i, b in enumerate(a):
        for j, c in enumerate(b):
            for k, d in enumerate(c):
                if d > 1: yield (f'{i}{j}{k}', d)


for n in (1E1, 1E3, 1E6):
    n = int(n)
    data = setup_data(n)
    print(f'n = {n}')
    results = {}
    for name, func in list(globals().items()):
        if not name.startswith('f_') or not isinstance(func, types.FunctionType):
            continue
        print("{:16s}{:16.8f} ms".format(name[2:], timeit(
            'results[name] = f(**data)', globals={'f':func, 'data':data, 'results':results, 'name':name}, number=10)*100))
    for r in results:
        print('{:10}: {}'.format(r, sorted(list(results[r]))[:5]))

The array lookup method is very fast (even faster than @paul-panzer's numpy method!). Of course, it cheats since it isn't technicailly finished after it completes, because it's returning a generator. It also doesn't have to check every iteration if the value already exists, which is likely to help a lot.

n = 10
counter               0.10595780 ms
dict                  0.01070654 ms
array                 0.00135370 ms
f_counter : []
f_dict    : []
f_array   : []
n = 1000
counter               2.89462101 ms
dict                  0.40434612 ms
array                 0.00073838 ms
f_counter : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_dict    : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_array   : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
n = 1000000
counter            2849.00500992 ms
dict                438.44007806 ms
array                 0.00135370 ms
f_counter : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_dict    : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_array   : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
Turksarama
  • 1,136
  • 6
  • 13
  • 1
    So what are you comparing exactly? Shouldn't you return lists instead of unused generators? – Eric Duminil Dec 01 '17 at 13:20
  • `Counters` aren't used that way. Used properly, they become the fastest option with your example. If you use `timeit` with a list insted of a generator, your method becomes slower than `Counter` or `dict`. See [here](https://pastebin.com/fyFTRL5n). – Eric Duminil Dec 01 '17 at 13:39
  • Finally, your `f_array` could be faster if you first convert every char to an int : `ints = [int(c) for c in text]` and then use `i, j, k = ints[n:n+3]`. – Eric Duminil Dec 01 '17 at 13:42
1

Image as answer:

IMAGE AS ANSWER

Looks like a sliding window.

clemens
  • 16,716
  • 11
  • 50
  • 65
1

Here is my solution:

from collections import defaultdict
string = "103264685134845354863"
d = defaultdict(int)
for elt in range(len(string)-2):
    d[string[elt:elt+3]] += 1
d = {key: d[key] for key in d.keys() if d[key] > 1}

With a bit of creativity in for loop(and additional lookup list with True/False/None for example) you should be able to get rid of last line, as you only want to create keys in dict that we visited once up to that point. Hope it helps :)

econ
  • 495
  • 3
  • 6
  • 16
  • See [pho7's answer](https://stackoverflow.com/a/47581953/3789665). And comments. Try to figure out *why* it doesn't get loads of votes. – greybeard Dec 27 '17 at 01:07
0

-Telling from the perspective of C. -You can have an int 3-d array results[10][10][10]; -Go from 0th location to n-4th location, where n being the size of the string array. -On each location, check the current, next and next's next. -Increment the cntr as resutls[current][next][next's next]++; -Print the values of

results[1][2][3]
results[2][3][4]
results[3][4][5]
results[4][5][6]
results[5][6][7]
results[6][7][8]
results[7][8][9]

-It is O(n) time, there is no comparisons involved. -You can run some parallel stuff here by partitioning the array and calculating the matches around the partitions.

Suresh
  • 27
  • 1
  • 8
-1
inputStr = '123456123138276237284287434628736482376487234682734682736487263482736487236482634'

count = {}
for i in range(len(inputStr) - 2):
    subNum = int(inputStr[i:i+3])
    if subNum not in count:
        count[subNum] = 1
    else:
        count[subNum] += 1

print count
  • Thanks for your answer but it is too similar of an algorithm as was given by @abhishek arora 5-6 days ago. Also the original question was not asking for the algorithm but rather a different question (which was already answered multiple times) – its.david Dec 05 '17 at 14:57