How to count unique values from 1m string inputs?

Question

I have a task at my university and still can not get it. On input I have N <= 1000000 (how many strings will be next) and strings. Strings are less then 1000 chars.

I need to print one number stdout - value how many unique strings. How to do that? The major restriction is that I can use only numpy lib, 5 seconds limit time and 5 Mb RAM (!). It also says the answer is correct if it's not more than 5% difference from real answer.

I tried this code:

import numpy as np


N = int(input())
a = np.array([])
for i in range(N):
    x = input()
    if not np.any(a == x):
        a = np.append(a, x)

print(len(a))

but it took 12 Mb and 97 ms. Another code:

N = int(input())
results = np.empty(N, dtype=object)
for i in range(N):
    results[i] = input()

print(len(np.unique(results)))

..it took 10 Mb

Any ideas how to get that? :)

UPDATED: I don't know what a.. but I checked now this code:

N = int(input())
a = np.array([])
cc = 0
for i in range(N):
    x = input()
    cc += 1
    if cc < 500:
        if not np.any(a == x):
            a = np.append(a, x)


print(len(a))

and it showed me 81ms and 8.7Mb. How is that possible if I filled only 500 elems in array?

TEST 3:

this took 98ms and 6.36Mb (almost 5!)

N = int(input())
s = set()
for i in range(N):
    x = input()
    s.add(x)

print(len(s))

TEST 4:

this took 98ms and 5.41Mb.

import hashlib

N = int(input())
s = set()
for i in range(N):
    x = input()
    s.add(hashlib.md5(x.encode()).hexdigest())

print(len(s))

TEST 5:

5.32Mb

import hashlib

N = int(input())
s = set()
s_add = s.add
for i in range(N):
    s_add(hashlib.md5(input().encode()).hexdigest()[:-3])

print(len(s))

TEST 6:

98ms and 5.63Mb

import hashlib
import itertools

N = int(input())
s = set()
s_add = s.add
for _ in itertools.repeat(None, N):
    s_add(str(abs(hash(input())))[:-3])

print(len(s))

TEST 7:

179ms and 6.92Mb

import itertools

N = int(input())
s = set()
s_add = s.add
for _ in itertools.repeat(None, N):
    s_add(abs(hash(input())))

print(len(s))

TEST 8:

286ms and 5.15Mb

N = int(input())
s = set()
s_add = s.add
for i in range(N):
    s_add(abs(hash(input())))

print(len(s))

You have potentially a gigabyte of input data (if it is of maximum length, and each string is of maximum length) - you're obviously not going to be able to store all the strings in 5 MB. The problem does not appear solvable unless there's some assumption you can make that you haven't told us about. If the strings are guaranteed unique in the first few characters, you could store up to a 5-character prefix of each string - I think that's `dtype='s5'` in numpy terms. Or, you could store 32-bit hashes of each string, if you can live with the possibility of hash collisions. — jasonharper, Aug 06 '19 at 19:16
@jasonharper I agree that it seems unlikely. One small chance I see from the description and code that maybe items are read in one at a time. In this case I would skip numpy entirely and use the built-in [set()](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset) class. Sets only allow for unique items and lookup operations are acheived in constant time. Without more detail on the input, I can't say whether this would make it solvable or not. — G. Anderson, Aug 06 '19 at 19:19
@G.Anderson only issue there is the problem specifies using numpy. Otherwise yeah, just doing `len(set(L))` is my immediate thought. — Andy, Aug 06 '19 at 19:20
@Andy fair point, but generally when such a restriction is applied it's "You can only use the standard library and numpy" because otherwise they would have to reinvent the wheel w.r.t. input and for-loops anyway. `set()` doesn't require any additional imports, so this is really a good clarification for the OP: Are you required to do all the work _only in numpy_, or are you just not allowed to use _additional imported libs beyond numpy_? — G. Anderson, Aug 06 '19 at 19:25
What happens if you replace the `np.any` check with an `np.isin` in your first example? — Andy, Aug 06 '19 at 19:34
@BradSolomon almost time, check my updates please. Any improvements can you suggest? It's almost 5Mb — sirjay, Aug 06 '19 at 19:49
@Andy. Same results. But set() is almost 5Mb, check updates please — sirjay, Aug 06 '19 at 19:49
What if you add a substring check, and only store the longest occurrence as a dict key with a value of the number of substrings? I'm skeptical that it would run in 5 seconds, but it might reduce the memory footprint. — Andy, Aug 06 '19 at 20:17
You'd have to short-circuit at the first superstring found, or you run the risk of double or triple counting simple strings. — Andy, Aug 06 '19 at 20:18
Wait, according to new details in the question, Test #8 was a success: You need to be within 5% of 5 seconds and 5MB RAM, and test #8 is within 4%. — Andy, Aug 06 '19 at 20:21
Honestly, my suggestion is to show your results to whomever gave you this task and ask what you're missing. Both numpy and set trade additional memory overhead for better speed, while a simple list gives better memory performance but loses lookup speed for duplicate checking, and hashing libraries are outside of the numpy requirement — G. Anderson, Aug 06 '19 at 20:24
@Andy, oh sorry for misunderstanding, 5% means if correct answer is for example 1000 unique strings, so 1050 will be acceptable (RAM must be < 5Mb always) — sirjay, Aug 06 '19 at 20:25
Oh, darn... Then how about that dict/substring count approach? Any luck there? — Andy, Aug 06 '19 at 20:26

Brad Solomon · Answer 1 · 2019-08-06T20:04:44.763

A couple of speedups:

Use itertools.repeat(); it will avoid manufacturing distinct integer objects in each loop.
Bind the name s_add to s.add for faster lookup within the tightly-bound loop: s = set(); s_add = s.add, then call s_add(x) within the loop.*

^{*If memory serves me, a recent version of Python 3 did a much better job of minimizing the difference in attribute lookups. This one is probably of dubious incremental benefit, but give it a try.}

Regarding hashing - two common ones two choose from would be Md5 and Sha1, which produce hashed bytes objects of 16 and 20 bytes, respectively. On my laptop at least, Sha1 comes out faster, but not by a ton:

>>> import hashlib                                                                                                                                                                                                                                           
>>> import string
>>> import random
>>> b = bytes( 
...     "".join(random.choices(string.ascii_letters, k=999)), 
...     'utf-8' 
... )
>>> %timeit hashlib.md5(b).digest()                                                                                                                                                                                                                          
2.48 µs ± 8.62 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit hashlib.sha1(b).digest()                                                                                                                                                                                                                         
1.96 µs ± 6.12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

But you may not need something that heavyweight. (You're not interested in the hash for its cryptographic security; you're interested in it for the sake of saving space in the accumulating set.) There's also the builtin hash(), and though this isn't guaranteed to be the same for the same input across Python sessions, that wouldn't seem like a requirement because you're processing the strings from a single process:

>>> %timeit hash(b)                                                                                                                                                                                                                                          
84.9 ns ± 0.193 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

As a disclaimer, I'm not too familiar (okay, not familiar at all) with the implementation of hash() and the entropy it produces. [Some reading here.) I would dare say the chance of collision is mathematically higher than with Md5 + Sha1, though probably still quite, quite low.

Thanks. About 2) - I changed and same time and RAM. – sirjay Aug 06 '19 at 19:47 — sirjay, Aug 06 '19 at 19:47

vahvero · Answer 2 · 2019-08-06T22:09:40.600

0

I've been following with great interest and would suggest (with all credit to @Brad Solomon)

import hashlib
import itertools

N = int(input())
s = set()
s_add = s.add
for _ in itertools.repeat(None, N):
    s_add(hashlib.md5(input().encode()).hexdigest()[:-3])

print(len(s))

(This really thou is not really by using only the numpy-lib.)

EDIT so something like this

N = int(input())
s = set()
s_add = s.add
for i in range(N):
    s_add(input()[:-300])

print(len(s))

edited Aug 06 '19 at 22:09

answered Aug 06 '19 at 20:02

vahvero

525
11
24

I have updated: `itertools.repeat(None, N)` took more RAM :( – sirjay Aug 06 '19 at 20:13
What if you'd reduce the bits in your current hash and try `for i in range(N)` to `for _ in range(N)`. It only requires 0.15Mb reduction now! – vahvero Aug 06 '19 at 20:58
I believe the key to success is to accept the 5% failure rate and only store minimal amount of data to set(). So if we take the **TEST3** and change it to ´s_add(input[:-50])´ we might drop RAM usage further. Or **TEST8** and store only 95% percent of the values in the hash. – vahvero Aug 06 '19 at 22:05

How to count unique values from 1m string inputs?

2 Answers2