1

I'm working with twitter ids which are strings because they are so huge.

Twitter's api has a "Since_id" and I want to search tweets since the earliest tweet in a list.

For example:

tweet_ids = [u'1003659997241401843', u'1003659997241401234234', u'100365999724140136236'] # etc
since_id = min(tweet_ids)

So far min(tweet_ids) works but I want to understand why it works because I want to know if it is just by chance that it worked on the few samples I gave it, or if it is guaranteed to always work.

Edit: To clarify I need to get the lowest tweet id. How do I get the lowest tweet id if they are strings that are > 2^32-1 and therefore can't be represented as integers in python 2.7 on a 32 bit machine.

I am using python 2.7 if that matters

Terence Chow
  • 10,755
  • 24
  • 78
  • 141
  • 3
    If all strings are guaranteed to be the same length, it _might_ be slightly faster to do `min(tweet_ids)` instead of `min(tweet_ids, key=int)`. But the speed gain is unlikely to matter, and you'd need some kind of comment explaining why it's safe to compare them as strings or your code will look suspicious to any reader (including you in a few months), so I'd go with the conversion to `int`. – abarnert Jun 04 '18 at 17:50
  • 1
    *I'm working with twitter ids which are strings because they are so huge* This is a non-problem in Python, whose ints are not limited to machine words. Represent the ids as ints, and your problems will go away. – user4815162342 Jun 04 '18 at 17:53
  • 2
    In Python 2.7, you might want to use `key=long` instead of `key=int`. Python 2.x had two different integer types; `int` is as big as a C `long` (usually 32 bits), while `long` is a list of digits that can go as long as you have memory for. In 2.7, `int` values are automatically promoted to `long` when necessary, but if you care more about clarity than 3.x portability, I think it makes sense to be explicit here. – abarnert Jun 04 '18 at 17:53
  • 1
    Meanwhile, the reason the Twitter API sends these numbers as strings is a limitation of JSON. JSON allows all numbers to be represented by a `float64`. If you stick a 70-bit integer in a `float64`, you lose the rightmost few digits, which is obviously not acceptable. So they pass them as strings, which you can convert to `long` in Python 2.x, `int` in Python 3.x, `BigInt` in Scala, or whatever type your language offers that JSON doesn't. – abarnert Jun 04 '18 at 17:56
  • @abarnert I don't think this question is a duplicate, since it involves finding better alternatives to string comparison. I provided an alternative approach to his question which involves conversion to an int and back. – Ṃųỻịgǻňạcểơửṩ Jun 04 '18 at 18:02
  • 1
    It's a little silly that XY questions like this one get marked as duplicates. Even before the edits, you could see that the problem of the OP was not how strings are compared.. Anyways, abernert's comments above are the (best IMHO) answer to the question. – Joooeey Jun 04 '18 at 18:03
  • 1
    There are definitely other questions on SO about ints being compared lexicographically and what to do about it, so rather than un-dup this, we should just find the best one and retarget it. – abarnert Jun 04 '18 at 18:04
  • 1
    I suppose the ideal duplicate question would be one that covered the use of `key` as well as converting all of the numbers. Which means there may not be a single ideal dup, but we can always add two (or more) targets. – abarnert Jun 04 '18 at 18:07
  • There are multiple questions on natural sorting, like [this one](https://stackoverflow.com/questions/6062973/compare-two-python-strings-that-contain-numbers), but that's overkill here (as well as not obviously related), so it's definitely not the right dup; just something to get on the Linked Questions list… – abarnert Jun 04 '18 at 18:13
  • There's still a side question on `int` vs. `long` in Python 2.x, and when they get auto-converted in later 2.x, and so on. If you're still confused by that, there's probably an existing question, but if you can't find one, ask and… someone else can find the duplicate. – abarnert Jun 04 '18 at 18:23

2 Answers2

0

Python will compare these strings exactly as it compares any other strings; that is, it will compare them lexicographically.

Thus, it will put 12 before 2, which may be undesirable for you.

Here's a function that will compute the numerical minimum of strings representing integers for you.

# A is an iterable of strings representing integers.
def numerical_min(A):
    cur_min = A[0]
    for x in A[1:]:
        if len(x) < len(cur_min):
            cur_min = x
            continue
        if len(x) > len(cur_min):
            continue
        for m,n in zip(x, cur_min):
            if int(m) < int(n):
                cur_min = x
                break
    return cur_min
merlin2011
  • 71,677
  • 44
  • 195
  • 329
0

From the Python Documentation, it implies that all Strings, including your case where the strings are large sequences of digits, are compared lexicographically.

  • The "lesser integer" string 2 is less than then "greater integer" string 100 in this case.
  • Negative integers sorted lexicographically are "greater" than positive integers. "-1" is greater than "99" when compared this way because the minus hyphen is lexicographically greater than all digits.
  • Equal integers "2" and "02" aren't necessarily equal in terms of string comparison. "02" is less than "2" string-wise because of the leading zero.

It is better to convert the str into a long int, and then compare it. As in

  • tweet_ids = [long('1003659997241401843'), long('1003659997241401234234'), long('100365999724140136236')]
  • since_id = min(tweet_ids)

Since JSON does not allow 70-bit long ints, convert the smallest int back into a str. Replace the since_id line with

  • since_id = min(tweet_ids, key=int)
  • 1
    I thought twitter uses strings because the numbers are so large they can't be represented as ints? – Terence Chow Jun 04 '18 at 17:49
  • They can be represented as ints. Large ints and longs with arbitrary precision. Sort them as ints. – Ṃųỻịgǻňạcểơửṩ Jun 04 '18 at 17:49
  • I disagree. I'm using python 2.7 and on google app engine. My guess is app engine is still 32 bit so that would mean my largest int would 4.3 billion. These tweet ids are much larger than 4 billion – Terence Chow Jun 04 '18 at 17:52
  • compare for how Python handles int: https://stackoverflow.com/questions/7604966/maximum-and-minimum-values-for-ints – Joooeey Jun 04 '18 at 17:52
  • @TerenceChow They're too large to be represented as integers _in JSON_. That's because JSON explicitly documents that numbers are to be treated as `float64` values, and if you store a 70-bit integer in a `float64` value, you lose the last few digits. They're not too large to be represented as integers _in Python_, because Python represents integers as a list of (30-bit) digits, so you can hold as big a number as you want (until you run out of memory). – abarnert Jun 04 '18 at 17:53
  • @abamert Sorry about that. Is there any way to improve my answer? One suggestion is to convert the least int *back* into a string. Is it recommended that this question is to be de-duped since there is a *better* alternative to just comparing strings? – Ṃųỻịgǻňạcểơửṩ Jun 04 '18 at 17:53
  • @abarnert ahhh didn't know that! thank youu – Terence Chow Jun 04 '18 at 17:59
  • @TerenceChow Do you want to consider converting the smallest int to a str? – Ṃųỻịgǻňạcểơửṩ Jun 04 '18 at 18:00
  • @Mulliganaceous I don't think the question should be deduped. But meanwhile, your answer could be improved in a few ways. First, if the tweet IDs are going to be used as strings, just use `min(tweet_ids, key=int)` instead of converting them all and then converting back. Second, as I explained in a comment, if this code isn't meant to be portable to 3.x, I'd use `long` explicitly rather than `int`. Also, it's 70 bits, not digits. – abarnert Jun 04 '18 at 18:03
  • @abarnert Done. Thanks for suggesting me a better answer. Since there may be a subtle difference in comparing generic strings and how the asker intended to compare these int-strings, is it better to suggest this question to leave it as a dupe, to retarget the dupe? – Ṃųỻịgǻňạcểơửṩ Jun 04 '18 at 18:10
  • @Mulliganaceous As I said in a comment on the question, I think it's better to retarget the dup… as soon as someone can find a good duplicate. There's nothing relevant in https://sopython.com/canon and searching is always a pain, but I'm looking… – abarnert Jun 04 '18 at 18:12