0

I started working through the CSES problem set. And I am kind of surprised by this problem - https://cses.fi/problemset/task/1621/ It's pretty straightforward - given n numbers, as a string input (like "2 3 3 2 3") find out the count of unique numbers.

I figured the fastest way to do it would be to use a map. So I split() the input string and used the map() function to comvert each of the string numbers to int.

Here's the code:

n = int(input())
nums = map(int, input().split())
nums = set(nums)
print(len(nums))

But this code returns a TLE for an input of 100,000 numbers.

So, I changed the implementation a little bit (just as a shot in the dark) and did not map the strings into integers. Instead I just created a set() out of strings. Here's the code:

n = int(input())
 
nums = input().split()
nums = set(nums)
 
print(len(nums))

And this passes for that input of 100,000 numbers

I was under the assumption that integers are easier to hash than strings. So, it means that the map() function is the culprit. What's the time complexity of the map function?

This accepted answer in the following question suggests that the time complexity for non base 2 number strings is O(N**2) where N is the number of digits in the input string to the int() function. What is the complexity of str() function in Python3?

So, is the time complexity of the map(int, str) function O(N**2) where N is the length of the base 10 string number?

Abhinav S.
  • 30
  • 7
  • 1
    its not O(N**2), since there's only N elements to process. instead the time taken to parse the ints incurs an overhead that is probably causing the TLE – cs95 May 02 '23 at 18:07
  • 1
    Python strings are hashed up front, and when it is needed, it is simply retreived. since you create the strings first, you save no time by converting to `int` – juanpa.arrivillaga May 02 '23 at 18:10
  • @cs95 oh right, of course. I updated the Question statement a little bit. I meant `N` as the length of the input string to the map(int, str) function – Abhinav S. May 02 '23 at 18:15
  • @juanpa.arrivillaga I did not know about that. Thanks, that makes a lot of sense – Abhinav S. May 02 '23 at 18:16
  • @AbhinavS. actually, I take that back. I thought I remembered that applying to Python strings, but I cannot confirm it (and the source code for Python `str` type is quite a lot to try to grok, but it seems I am incorrect, but there are a lot of C-idioms I do not understand. – juanpa.arrivillaga May 02 '23 at 18:24
  • @AbhinavS. ah, wait, no, it does seem to be cached, [see here](https://github.com/python/cpython/blob/587f2f018051049cf5d9de3e12ed5aa7644404dc/Objects/unicodeobject.c#L11028) but I'm not sure if it is actually computed up front. – juanpa.arrivillaga May 02 '23 at 18:29
  • @juanpa.arrivillaga I don't believe that they would be hashed up front otherwise there wouldn't be some voodoo around what might or might not get interred? If you had the hash up front for _every_ string then cpython would just use a lookup to see whether it needed to allocate new memory? – roganjosh May 02 '23 at 18:37
  • @roganjosh yes that is my thinking as well. Note, a lot of strings *are cached* or interned by the interpreter. But generally speaking, I think that is right. – juanpa.arrivillaga May 02 '23 at 18:38
  • @roganjosh anyway, empiriclaly, if I define a `def time_hash(s: str) -> float:` function, then: `time_hash('A')` gives `5.290000046898058e-07`, then `time_hash(1_000_000*"A")` gives `0.0005778240000040569`, `time_hash(10_000_000*"A")` gives `0.005646631999979945`. and can also verifying the caching behavior using `s = 10_000_000*"A"` then looking at how calling `time_hash(s)` over and over again behaves – juanpa.arrivillaga May 02 '23 at 18:44

0 Answers0