0

Generating a List of random strings, then using a for/in loop and also a List Comprehension expresison to fund the longest string and the length of that string.

Both techniques compute the max length correctly, but sometimes the for/in loop finds the same longest word as the List Comprehension, sometimes not. Why? What's the logic error?

import random
import string
def cobble_large_dataset(dataset_number_of_elements):
    '''
    Build a list of Lists, each List is a String of a random sequence of 1-10 characters
    '''
    myList = []         # Empty List   
    for i in range(0,dataset_number_of_elements):
        string_length = random.randint(1, 10)
        tmp = ''.join(random.choices(string.ascii_uppercase + string.digits, k=string_length))  # https://stackoverflow.com/questions/2257441/random-string-generation-with-upper-case-letters-and-digits
        tmp = [tmp]
        #print(tmp)
        myList.extend([tmp])
    return myList    

def list_comprehension_test(wordsList):
    '''
    Process a List of Lists using List Comprehension. 
    Each List in the List of Lists is a single String
    '''
    start_time = time.time()
   
    maximumWordLength, longest_word = max([(len(x[0]), x[0]) for x in wordsList]) # This works because x is a List of strings
    return ((time.time() - start_time), longest_word, maximumWordLength)

def brute_force_test(wordsList):
    '''
    Process a List of Lists using a brute-force for/in loop. 
    Each List in the List of Lists is a single String    
    '''
    start_time = time.time()
    maximumWordLength = 0
    for word in wordsList:
        tmp = word[0]
        #print(tmp)
        if (len(tmp) >= maximumWordLength):
            maximumWordLength = len(tmp)
            longest_word = tmp
            #print(tmp)
            #print(longest_word + " : " + str(maximumWordLength))
    return ((time.time() - start_time), longest_word, maximumWordLength)

import time
start_time = time.time()
dataset = cobble_large_dataset(100)
print (str(len(dataset)) + ' Strings generated in ' + str((time.time() - start_time)) + ' seconds.')

# Let's see if both techniques produce the same results:
result_brute_force = brute_force_test(dataset)
print('Results from Brute Force = ' + result_brute_force[1] + ', ' + str(result_brute_force[2]) + ' characters' )
result_list_comprehension = list_comprehension_test(dataset)
print('Results from List Comprehension = ' + result_list_comprehension[1] + ', ' + str(result_list_comprehension[2]) + ' characters' )
if (result_list_comprehension[1] == result_brute_force[1]):
    print("Techniques produced the same results.")
else:
    print("Techniques DID NOT PRODUCE the same results
nicomp
  • 4,344
  • 4
  • 27
  • 60
  • fyi, `myList.append(tmp)` is the same as `myList.extend([tmp])` and is more natural and clear. You do just want to append `tmp` to the list, right? – CryptoFool Mar 21 '21 at 14:38

2 Answers2

1

You're passing a list of tuples to max, with no key function, so max is comparing tuples, not just lengths. When the lengths tie, tuple comparison moves on to comparing the second elements, the strings themselves, so the max in case of a length tie is the string that compares greatest (by lexicographic code point comparison).

In contrast, in case of a length tie, your loop picks whichever candidate appears last. (If you had used > instead of >= in if (len(tmp) >= maximumWordLength):, it would pick the first candidate.)

(Also, that's some weird stuff you're doing with tmp. The 1-element lists you're building are pointless - cobble_large_dataset should just return a flat list of strings.)

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • 1
    I think `max` looks at the second value in the case of ties. So it doesn't pick the first, but the "biggest" by the second value, i.e. biggest lexicographically. – Yevhen Kuzmovych Mar 21 '21 at 14:36
  • I already tried changing the comparison to > rather than >= and the inconsistent results don't change. – nicomp Mar 21 '21 at 14:36
  • @YevhenKuzmovych: Ah, right, forgot this code is doing straight tuple comparison instead of using a `key` function. Been up too long. Fixed now. – user2357112 Mar 21 '21 at 14:37
1

In your list comprehension case, you want to tell max to just operate on the first item in each of the pairs of values in the list. This is the equivalent of what the for-loop case is doing, since it only considers the length of each string. So you want:

maximumWordLength, longest_word = max(
    [(len(x[0]), x[0]) for x in wordsList],
    key = lambda x: x[0])  # This works because x is a List of strings

As others have already pointed out, you also want to change the >= comparison in the brute-force case to >. If you make these two changes, you will get the same result from your two methods.

CryptoFool
  • 21,719
  • 5
  • 26
  • 44
  • Thanks. I understand now that max() considers the arguments in left-to-right order in order to break ties. Awesome! – nicomp Mar 21 '21 at 15:03