59

I am looking for a function to make the list as unsorted as possible. Preferably in Python.

Backstory:

I want to check URLs statuses and see if URLs give a 404 or not. I just use asyncio and requests modules. Nothing fancy.

Now I don't want to overload servers, so I want to minimize checking URLs which are on the same domain at the same time. I have this idea to sort the URLs in a way that items which are close to one another (having the same sort key = domain name) are placed as far apart from each other in the list as possible.

An example with numbers:

a=[1,1,2,3,3]  # <== sorted list, sortness score = 2
   0,1,2,3,4   # <== positions

could be unsorted as:

b=[1,3,2,1,3]  # <== unsorted list, sortness score = 6
   0,1,2,3,4   # <== positions

I would say that we can compute a sortness score by summing up the distances between equal items (which have the same key = domain name). Higher sortness means better unsorted. Maybe there is a better way for testing unsortness.

The sortness score for list a is 2. The sum of distances for 1 is (1-0)=1, for 2 is 0, for 3 is (4-3)=1.

The sortness score for list b is 6. The sum of distances for 1 is (3-0)=3, for 2 is 0, for 3 is (4-1)=3.

URLs list would look something like a list of (domain, URL) tuples:

[
   ('example.com', 'http://example.com/404'),
   ('test.com', 'http://test.com/404'),
   ('test.com', 'http://test.com/405'),
   ('example.com', 'http://example.com/405'),
   ...
]

I am working on a prototype which works Ok-ish, but not optimal as I can find some variants which are better unsorted by hand.

Anyone wants to give it a go?

This is my code, but it's not great :):

from collections import Counter
from collections import defaultdict
import math


def test_unsortness(lst:list) -> float:
    pos = defaultdict(list)
    score = 0
    # Store positions for each key
    # input = [1,3,2,3,1] => {1: [0, 4], 3: [1, 3], 2: [2]}
    for c,l in enumerate(lst):
        pos[l].append(c)
    for k,poslst in pos.items():
        for i in range(len(poslst)-1):
            score += math.sqrt(poslst[i+1] - poslst[i])
    return score


def unsort(lst:list) -> list:
    free_positions = list(range(0,len(lst)))
    output_list = [None] * len(free_positions)
    for val, count in Counter(lst).most_common():
        pos = 0
        step = len(free_positions) / count
        for i in range(count):
            output_list[free_positions[int(pos)]] = val
            free_positions[int(pos)] = None  # Remove position later
            pos = pos + step
        free_positions = [p for p in free_positions if p]
    return output_list


lsts = list()
lsts.append( [1,1,2,3,3] )
lsts.append( [1,3,2,3,1] )       # This has the worst score after unsort()
lsts.append( [1,2,3,0,1,2,3] )   # This has the worst score after unsort()
lsts.append( [3,2,1,0,1,2,3] )   # This has the worst score after unsort()
lsts.append( [3,2,1,3,1,2,3] )   # This has the worst score after unsort()
lsts.append( [1,2,3,4,5] )

for lst in lsts:
    ulst = unsort(lst)
    print( ( lst, '%.2f'%test_unsortness(lst), '====>', ulst, '%.2f'%test_unsortness(ulst), ) )

#  Original               score             Unsorted               score
#  -------                -----             --------               -----
# ([1, 1, 2, 3, 3],       '2.00',  '====>', [1, 3, 1, 3, 2],       '2.83')
# ([1, 3, 2, 3, 1],       '3.41',  '====>', [1, 3, 1, 3, 2],       '2.83')
# ([1, 2, 3, 0, 1, 2, 3], '6.00',  '====>', [1, 2, 3, 1, 2, 3, 0], '5.20')
# ([3, 2, 1, 0, 1, 2, 3], '5.86',  '====>', [3, 2, 1, 3, 2, 1, 0], '5.20')
# ([3, 2, 1, 3, 1, 2, 3], '6.88',  '====>', [3, 2, 3, 1, 3, 2, 1], '6.56')
# ([1, 2, 3, 4, 5],       '0.00',  '====>', [1, 2, 3, 4, 5],       '0.00')

PS. I am not looking just for a randomize function and I know there are crawlers which can manage domain loads, but this is for the sake of exercise.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
ddofborg
  • 2,028
  • 5
  • 21
  • 34
  • 8
    [`random.shuffle`](https://docs.python.org/3/library/random.html#random.shuffle)? – Matiiss Dec 28 '21 at 14:33
  • 3
    Surprisingly, there's a question that seems to be exactly on this topic https://cs.stackexchange.com/questions/11836/how-to-measure-sortedness – shriakhilc Dec 28 '21 at 15:03
  • I'm not sure how good this idea is but what if you manually try to do: `[c,a,a,a,b,b]` -> `count c:1, a:3, b:2` -> `sort` -> new array that will be filled from the two sides: `[a,_,_,a,_,a]` -> `[a,b,_,a,b,a]` -> `[a,b,c,a,b,a]`. If I got it correctly the main question is in the distance and "score". – IgorZ Dec 28 '21 at 15:16
  • The scores for your two example lists `a` and `b` seem to ignore the `sqrt` that `test_unsortness` is doing (which is quite important, otherwise positions `0,2,4,6,8` is no better than `0,3,4,5,8` although it is more "evenly" spread) – Bergi Dec 28 '21 at 23:30
  • 1
    See also [Randomising a list whilst ensuring duplicates are not sequential](https://stackoverflow.com/q/29797136/1048572), [Disperse Duplicates in an Array](https://stackoverflow.com/q/17899312/1048572), and [An algorithm that spreads similar items across a list](https://softwareengineering.stackexchange.com/q/262557/66652) – Bergi Dec 28 '21 at 23:47
  • 21
    How about you just stick the URLs in a dictionary (a multimap, a list of lists, using the domain name as key), and just walk the keys linearly, removing items along the way? – Filip Milovanović Dec 29 '21 at 00:41
  • 4
    I agree with @FilipMilovanović, a round robin approach seems the simplest – Kartik Soneji Dec 29 '21 at 08:44
  • 1
    @FilipMilovanović Exactly. And since the goal is to not overload the servers, you can keep a dict with domain as key, and last download time as value. Before downloading again from the server, you can make sure to wait until last_download_time + DESIRED_PAUSE. – Eric Duminil Dec 29 '21 at 08:59
  • 1
    I can imagine other ways to solve this problem for domains. Even with consumer and producer queue to always give the next least recently checked domain.... but this is more of a question/idea I had and though it sound simple, I was not able to create a solution which worked better than pen and paper (for small lists). – ddofborg Dec 29 '21 at 12:44
  • 4
    Does your "unsortedness" score represent your true objective, or is it simply convenient? When actually querying domains, is a pattern of `[3, 2, 1, 0, 1, 2, 3]` truly better than `[1, 2, 3, 0, 1, 2, 3]` ? If the goal is "spread things out", having one separation of 100 and fifty of 1 is probably worse than having fifty-one of ~3 each, despite both being 150 with your in-text metric. (Granted, you can find better for that setup, but looking at extremes can help gauge if the metric is misspecified.) -- P.S. "Sort" is a bit of a misnomer here, as you don't really care about total ordering. – R.M. Dec 29 '21 at 15:21
  • 1
    @Matiiss: Even better, `random.shuffle` 1000 times, keeping the one with the highest "sortness" score. – James Dec 29 '21 at 21:22
  • 1
    *unsorted* seems to be a misnomer here, you actually want the list *sorted*, just to an uncommon specification (similar items having maximized distanced). At least I've always seen sorted defined (in terms of computing) as ~"the arrangement of data in a specified sequence." To be, probably over, pedantic... But great question overall, no matter what you think of my comment :) – TCooper Dec 29 '21 at 23:59
  • The code doesn't match the text of the question. Is the score dist^0.5? dist^1? dist^2? Other? – TLW Dec 30 '21 at 03:22
  • Yeah, sorry. I added `math.sqrt` later. – ddofborg Dec 30 '21 at 11:05
  • Different domain names are a poor indicator for being on different servers. If you really want to avoid overlading servers, use the IP addresses instead (which is also not perfect, but probably as good as you can do it). – Henrik supports the community Dec 30 '21 at 19:00
  • Exploration of this question at or.stackechange.com: [A stackoverflow user's curious problem of maximising unsortedness](https://or.stackexchange.com/questions/7501/a-stack-overflow-users-curious-problem-of-maximising-unsortedness) – Stef Jan 04 '22 at 14:32
  • @TCooper perhaps "anti-sorted" would be applicable (by analogy to anti-phase, anti-parallel, etc.) After all a random order could be obtained by assigning a random key to each item then sorting those keys - an old Excel trick - but despite the sort used there we wouldn't call the result sorted – Chris H Jan 04 '22 at 14:54
  • This should be as easy as finding the frequency of each individual element and distributing them evenly over the indices of a target array. – Redu Mar 24 '22 at 19:33

9 Answers9

32

Instead of unsorting your list of URLs, why not grouping them by domain, each in a queue, then process them asynchronously with a delay (randomised?) in between?

It looks to me less complex than what you're trying to do to achieve the same thing and if you have a lot of domain, you can always throttle the number to run through concurrently at that point.

gfache
  • 606
  • 6
  • 15
  • 2
    You make a really good point. But it's more of a comment than an answer. – Isaac Rabinovitch Dec 31 '21 at 12:03
  • 7
    I'd say OP is asking on how to not overload servers he wants to send requests to in the end so my answer is just, don't do what you want to do because it seems over complex for the need and here is an alternative. Sounds like a valid answer to me? – gfache Dec 31 '21 at 12:18
  • @IsaacRabinovitch the goal is not to unsort a list, whatever that means, but to not overload servers. And this answer is by far the easiest and cleanest way to do it. In comparison, it's ridiculous to use a large and complex optimization library, with poor performance for a large list. – Eric Duminil Dec 31 '21 at 16:29
  • 3
    @EricDuminil even if unsorting a list is not the most practical way to solve the OP's problem, I still think it's an interesting question in its own right. – Mark Ransom Jan 01 '22 at 06:47
  • I've upvoted but it's a bit more complicated than that. If you had say domain1 at 100 entries and then 50 domains at 2 entries each you'd almost immediately run exclusively on domain1. Ideally, with this setup, you'd want to "sandwich" domain1 urls with an entry from any of of the 50 domains, none of which are at risk of overload. – JL Peyret Jan 13 '22 at 22:35
  • @JLPeyret I agree this is more complicated than that but I'd argue that's also another subject which is already well known and that is already covered (maybe not in your preferred language but surely the principles are the same and reusable). [Quick search](https://stackoverflow.com/search?q=how+to+throttle+request). Getting familiar with notions like mutex, concurency and asynchony will definitely help better understanding your problem and designing a simpler and more appropriate solution. – gfache Jan 15 '22 at 23:33
18

I used Google OR Tools to solve this problem. I framed it as a constraint optimization problem and modeled it that way.

from collections import defaultdict
from itertools import chain, combinations
from ortools.sat.python import cp_model

model = cp_model.CpModel()
data = [
   ('example.com', 'http://example.com/404'),
   ('test.com', 'http://test.com/404'),
   ('test.com', 'http://test.com/405'),
   ('example.com', 'http://example.com/405'),
   ('google.com', 'http://google.com/404'),
   ('example.com', 'http://example.com/406'),
   ('stackoverflow.com', 'http://stackoverflow.com/404'),
   ('test.com', 'http://test.com/406'),
   ('example.com', 'http://example.com/407')
]

tmp = defaultdict(list)
for (domain, url) in sorted(data):
    var = model.NewIntVar(0, len(data) - 1, url)
    tmp[domain].append(var)  # store URLs as model variables where the key is the domain

vals = list(chain.from_iterable(tmp.values()))  # create a single list of all variables
model.AddAllDifferent(vals)  # all variables must occupy a unique spot in the output

constraint = []
for urls in tmp.values():
    if len(urls) == 1:  # a single domain does not need a specific constraint
        constraint.append(urls[0])
        continue
    combos = combinations(urls, 2)
    for (x, y) in combos:  # create combinations between each URL of a specific domain
        constraint.append((x - y))

model.Maximize(sum(constraint))  # maximize the distance between similar URLs from our constraint list

solver = cp_model.CpSolver()
status = solver.Solve(model)
output = [None for _ in range(len(data))]

if status == cp_model.OPTIMAL or status == cp_model.FEASIBLE:
    for val in vals:
        idx = solver.Value(val)
        output[idx] = val.Name()

print(output)
['http://example.com/407',
 'http://test.com/406',
 'http://example.com/406',
 'http://test.com/405',
 'http://example.com/405',
 'http://stackoverflow.com/404',
 'http://google.com/404',
 'http://test.com/404',
 'http://example.com/404']
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
gold_cy
  • 13,648
  • 3
  • 23
  • 45
  • 7
    I don't know the true size of the list that is going to be scrambled of course but wouldn't combinatory explosion be a problem here? I think the itertools combinations use is unfavourable in terms of big O. See this [post](https://stackoverflow.com/questions/53419536/what-is-the-computational-complexity-of-itertools-combinations-in-python) – Hommes Dec 29 '21 at 08:19
  • The combinations are restricted to a singular domain, not all URLs, but I understand your point. I could not think of a different way to add the constraint. I’ll take another look at the OR Tools API, perhaps there is something I can use to model that aspect differently. – gold_cy Dec 29 '21 at 12:00
  • 1
    You are maximising the sum of the differences, but shouldn't it be the sum of absolute values of the differences? – Alberto Santini Dec 30 '21 at 18:27
  • @AlbertoSantini it ends up being ultimately the same, the only difference might be the actual parameter of the URL but the domains will still be spaced apart appropriately – gold_cy Dec 30 '21 at 18:36
  • It's kinda interesting, but it still feels like using a nuke in order to kill flies. And possibly missing. – Eric Duminil Dec 31 '21 at 16:30
  • 1
    @gold_cy are you sure it ends up being the same? What if variable y takes a larger value than x? Then you are subtracting a distance, instead of adding. E.g., if for a domain you have three urls, with associated variables x, y, z, your objective function for those terms will be (x-y)+(x-z)+(y-z), won't it? But it should be |x-y|+|x-z|+|y-z|, these are not the same. – Alberto Santini Jan 01 '22 at 18:48
10

There is no obvious definition of unsortedness that would work best for you, but here's something that at least works well:

  1. Sort the list
  2. If the length of the list is not a power of two, then spread the items out evenly in a list with the next power of two size
  3. Find a new index for each item by reversing the bits in its old index.
  4. Remove the gaps to bring the list back to its original size.

In sorted order, the indexes of items that are close together usually differ only in the smallest bits. By reversing the bit order, you make the new indexes for items that are close together differ in the largest bits, so they will end up far apart.

def bitreverse(x, bits):
    # reverse the lower 32 bits
    x = ((x & 0x55555555) << 1) | ((x & 0xAAAAAAAA) >> 1)
    x = ((x & 0x33333333) << 2) | ((x & 0xCCCCCCCC) >> 2)
    x = ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4)
    x = ((x & 0x00FF00FF) << 8) | ((x & 0xFF00FF00) >> 8)
    x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16)
    # take only the appropriate length
    return (x>>(32-bits)) & ((1<<bits)-1)

def antisort(inlist): 
    if len(inlist) < 3:
        return inlist
    inlist = sorted(inlist)
    #get the next power of 2 list length
    p2len = 2
    bits = 1
    while p2len < len(inlist):
        p2len *= 2
        bits += 1
    templist = [None] * p2len
    for i in range(len(inlist)):
        newi = i * p2len // len(inlist)
        newi = bitreverse(newi, bits)
        templist[newi] = inlist[i]
    return [item for item in templist if item != None]

print(antisort(["a","b","c","d","e","f","g",
    "h","i","j","k","l","m","n","o","p","q","r",
    "s","t","u","v","w","x","y","z"]))

Output:

['a', 'n', 'h', 'u', 'e', 'r', 'k', 'x', 'c', 'p', 'f', 's',
 'm', 'z', 'b', 'o', 'i', 'v', 'l', 'y', 'd', 'q', 'j', 'w', 'g', 't']
Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • `[0, 1, 1, 2, 2, 3, 3]` for example, should return zero in the middle, like `[1,2,3,0,1,2,3]`, but it's not – ddofborg Dec 28 '21 at 17:30
  • @ddofborg I'm sure there is some kind of antisort that should put zero in the middle, but my way puts it at the start. – Matt Timmermans Dec 28 '21 at 19:36
  • 2
    I think - and the OP will correct me if I’m wrong about this - that the goal is less to take a sorted sequence and scramble it as much as possible and more about taking a list of values with duplicates and rearranging them such that identical values are spaced out as far apart from each other as possible. – templatetypedef Dec 29 '21 at 07:01
  • 1
    @templatetypedef The way I imagine this being used is that a list of URLs will be 'unsorted', a scraper will then pull those URLs in the unsorted order, and it would be nice if we didn't get many simultaneous requests to the same domain. That will reduce load on target web servers. But also, when we do get requests to the same domain, it would be nice if they hit different top-level URLs, for the services behind the web server, etc. – Matt Timmermans Dec 29 '21 at 13:14
  • @MattTimmermans The point is that 0 in the middle is a better antisort; the similar elements (1s 2s and 3s) are further apart. – Yakk - Adam Nevraumont Jan 04 '22 at 15:56
  • @Yakk-AdamNevraumont How is that generally better than putting n-1 in the middle? – Matt Timmermans Jan 04 '22 at 17:56
  • @MattTimmermans In the case given, a `0` in the middle increases the distance between the other duplicates, compared to `0` at the start. So the `0` in the middle is a better antisort, in that the goal is for duplicates to be as far apart as possible. Your system doesn't know about duplicates, it instead spreads indexes. So it misses that. – Yakk - Adam Nevraumont Jan 04 '22 at 21:47
  • 1
    That's right, it doesn't. There are duplicate domains, but there are no duplicate URLs in the real problem. – Matt Timmermans Jan 04 '22 at 22:14
4

You could implement an inverted binary search.

from typing import Union, List

sorted_int_list = [1, 1, 2, 3, 3]
unsorted_int_list = [1, 3, 2, 1, 3]

sorted_str_list = [
    "example.com",
    "example.com",
    "test.com",
    "stackoverflow.com",
    "stackoverflow.com",
]
unsorted_str_list = [
    "example.com",
    "stackoverflow.com",
    "test.com",
    "example.com",
    "stackoverflow.com",
]


def inverted_binary_search(
    input_list: List[Union[str, int]],
    search_elem: Union[int, str],
    list_selector_start: int,
    list_selector_end: int,
) -> int:
    if list_selector_end - list_selector_start <= 1:
        if search_elem < input_list[list_selector_start]:
            return list_selector_start - 1
        else:
            return list_selector_start

    list_selector_mid = (list_selector_start + list_selector_end) // 2
    if input_list[list_selector_mid] > search_elem:
        return inverted_binary_search(
            input_list=input_list,
            search_elem=search_elem,
            list_selector_start=list_selector_mid,
            list_selector_end=list_selector_end,
        )
    elif input_list[list_selector_mid] < search_elem:
        return inverted_binary_search(
            input_list=input_list,
            search_elem=search_elem,
            list_selector_start=list_selector_start,
            list_selector_end=list_selector_mid,
        )
    else:
        return list_selector_mid


def inverted_binary_insertion_sort(your_list: List[Union[str, int]]):
    for idx in range(1, len(your_list)):
        selected_elem = your_list[idx]
        inverted_binary_search_position = (
            inverted_binary_search(
                input_list=your_list,
                search_elem=selected_elem,
                list_selector_start=0,
                list_selector_end=idx,
            )
            + 1
        )

        for idk in range(idx, inverted_binary_search_position, -1):
            your_list[idk] = your_list[idk - 1]

        your_list[inverted_binary_search_position] = selected_elem
    return your_list

Output

inverted_sorted_int_list = inverted_binary_insertion_sort(sorted_int_list)
print(inverted_sorted_int_list)
>> [1, 3, 3, 2, 1]

inverted_sorted_str_list = inverted_binary_insertion_sort(sorted_str_list)
print(inverted_sorted_str_list)
>> ['example.com', 'stackoverflow.com', 'stackoverflow.com', 'test.com', 'example.com']

Update:

Given the comments, you could also run the function twice. This will untangle duplicates.

inverted_sorted_int_list = inverted_binary_insertion_sort(
    inverted_binary_insertion_sort(sorted_int_list)
)
>> [1, 3, 2, 1, 3]
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Hommes
  • 286
  • 1
  • 10
2

Here's a stab at it, but I am not sure it wouldn't degenerate a bit given particular input sets.

We pick the most frequent found item and append its first occurrence to a list. Then same with the 2nd most frequent and so on.

Repeat half the size of the most found item. That's the left half of the list.

Then moving from least frequent to most frequent, pick first item and add its values. When an item is found less than half the max, pick on which side you want to put it.

Essentially, we layer key by key and end up with more frequent items at left-most and right-most positions in the unsorted list, leaving less frequent ones in the middle.

def unsort(lst:list) -> list:
    """
    build a dictionary by frequency first
    then loop thru the keys and append 
    key by key with the other keys in between
    """

    result = []
    
    #dictionary by keys (this would be domains to urls)
    di = defaultdict(list)
    for v in lst:
        di[v].append(v)

    #sort by decreasing dupes length
    li_len = [(len(val),key) for key, val in di.items()]
    li_len.sort(reverse=True)

    #most found count
    max_c = li_len[0][0]
    #halfway point
    odd = max_c % 2
    num = max_c // 2
    if odd:
        num += 1

    #frequency, high to low
    by_freq = [tu[1] for tu in li_len]


    di_side = {}

    #for the first half, pick from frequent to infrequent
    #alternating by keys
    for c in range(num):

        #frequent to less
        for key in by_freq:

            entries = di[key]

            #first pass:  less than half the number of values 
            #we don't want to put all the infrequents first
            #and have a more packed second side
            if not c:
                #pick on way in or out?
                if len(entries) <= num:
                    #might be better to alternate L,R,L
                    di_side[key] = random.choice(["L","R"])
                else:
                    #pick on both
                    di_side[key] = "B"
                

            #put in the left half
            do_it = di_side[key] in ("L","B")
            
            if do_it and entries:
                result.append(entries.pop(0))

    #once you have mid point pick from infrequent to frequent
    for c in range(num):
        #frequent to less
        for key in reversed(by_freq):
            entries = di[key]

            #put in the right half 
            do_it = di_side[key] in ("R","B")
            
            if entries:
                result.append(entries.pop(0)) 

    return result

Running this I got:

([1, 1, 2, 3, 3], '2.00', '====>', [3, 1, 2, 1, 3], '3.41')
([1, 3, 2, 3, 1], '3.41', '====>', [3, 1, 2, 1, 3], '3.41')
([1, 2, 3, 0, 1, 2, 3], '6.00', '====>', [3, 2, 1, 0, 1, 2, 3], '5.86')
([3, 2, 1, 0, 1, 2, 3], '5.86', '====>', [3, 2, 1, 0, 1, 2, 3], '5.86')
([3, 2, 1, 3, 1, 2, 3], '6.88', '====>', [3, 2, 3, 2, 1, 3, 1], '5.97')
([1, 2, 3, 4, 5], '0.00', '====>', [5, 1, 2, 3, 4], '0.00')

Oh, and I also added an assert to check nothing had been dropped or altered by the unsorting:

assert(sorted(lst) == sorted(ulst))

alternate approach?

I'll leave it as a footnote for now, but the general idea of not clustering (not the OP's specific application of not overloading domains) seems like it would be a candidate for a force-repulsive approach, where identical domains would try to keep as far from each other as possible. i.e. 1, 1, 2 => 1, 2, 1 because the 1s would repel each other. That's a wholly different algorithmic approach however.

JL Peyret
  • 10,917
  • 2
  • 54
  • 73
1

When I faced a similar problem, here's how I solved it:

  1. Define the "distance" between two strings (URLs in this case) as their Levenshtein distance (code to compute this value is readily available)
  2. Adopt your favorite travelling-salesman algorithm to find the (approximate) shortest path through your set of strings (finding the exact shortest path isn't computationally feasible but the approximate algorithms are fairly efficient)
  3. Now modify your "distance" metric to be inverted -- i.e. compute the distance between two strings (s1,s2) as MAX_INT - LevenshteinDistance(s1,s2)
  4. With this modification, the "shortest path" through your set is now really the longest path, i.e. the most un-sorted one.
Jeremy Friesner
  • 70,199
  • 15
  • 131
  • 234
0

I hope this algorithm works correctly:

unsorted_list = ['c', 'a', 'a', 'a', 'a', 'b', 'b']

d = {i: unsorted_list.count(i) for i in unsorted_list}
print(d)  # {'c': 1, 'a': 4, 'b': 2}

d = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
print(d)  # {'a': 4, 'b': 2, 'c': 1}

result = [None] * len(unsorted_list)
border_index_left = 0
border_index_right = len(unsorted_list) - 1

it = iter(d)


def set_recursively(k, nk):
    set_borders(k)
    set_borders(nk)
    if d[k]:
        set_recursively(k, nk)


def set_borders(key):
    global border_index_left, border_index_right
    if key is not None and d[key]:
        result[border_index_left] = key
        d[key] = d[key] - 1
        border_index_left = border_index_left + 1
    if key is not None and d[key]:
        result[border_index_right] = key
        d[key] = d[key] - 1
        border_index_right = border_index_right - 1


next_element = next(it, None)
for k, v in d.items():
    next_element = next(it, None)
    set_recursively(k, next_element)

print(result)  # ['a', 'b', 'a', 'c', 'a', 'b', 'a']

Visually, it looks as walking from the edge to the middle:

[2, 3, 3, 3, 1, 1, 0]

[None, None, None, None, None, None, None]
[3, None, None, None, None, None, None]
[3, None, None, None, None, None, 3]
[3, 1, None, None, None, None, 3]
[3, 1, None, None, None, 1, 3]
[3, 1, 3, None, None, 1, 3]
[3, 1, 3, 2, None, 1, 3]
[3, 1, 3, 2, 0, 1, 3]
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
IgorZ
  • 1,125
  • 3
  • 20
  • 44
  • `[1,2,3,0,1,2,3]` this one generates a `maximum recursion depth exceeded` error. – ddofborg Dec 28 '21 at 17:26
  • Thanks! The issue was with the `if key and d[key]:` condition. The `if key is not None and d[key]:` might be ok. – IgorZ Dec 28 '21 at 17:39
  • `[1, 1, 2, 2, 3, 3]` returns `[1, 2, 3, 3, 2, 1]`, but `[3, 2, 1, 3, 1, 2, 3]` is a little better. – ddofborg Dec 28 '21 at 18:29
  • @ddofborg - your better list has one too many 3s. `[1, 2, 3, 3, 2, 1]` is (tied for) optimal under the distance metric in the text of the question (but not the code. I made a comment about that). – TLW Dec 30 '21 at 03:22
  • @tlw True, sorry for that. `[1, 1, 2, 2, 3, 3]` would be better as `[1,2,3,1,2,3]` as 3s are not next to each other. – ddofborg Dec 30 '21 at 11:09
  • 1
    `[1,2,3,1,2,3]` and `[1,2,3,3,2,1]` are tied under the distance metric in the text of the question. Sum of distances in both are `3+3+3=1+3+5=9`. – TLW Jan 01 '22 at 19:31
0

An easy way to scramble a list is to maximize its "sortness" score using a genetic algorithm with a permutation chromosome. I was able to hack quickly a version in R using the GA package. I'm not a Python user, but I am sure there are GA libraries for Python that include permutation chromosomes. If not, a general GA library with real-valued vector chromosomes can be adapted. You just use a vector with values in [0, 1] as a chromosome and convert each vector to its sort index.

prubin
  • 366
  • 2
  • 14
0

Just saying, putting a short time delay would work just fine. I think someone mentioned this already. It is very simple and very reliable. You could do something like:

from random import sample
from time import sleep
import requests
intervalList = list(range(0.1, 0.5))
error404 = []
connectionError = []
for i in your_URL_list:
    ststusCode = req.get(str(i)).status_code
    if ststusCode == 404:
        error404.append(i)
    sleep(sample(intervalList,1))

Cheers