I am looking for a function to make the list as unsorted as possible. Preferably in Python.
Backstory:
I want to check URLs statuses and see if URLs give a 404 or not. I just use asyncio
and requests
modules. Nothing fancy.
Now I don't want to overload servers, so I want to minimize checking URLs which are on the same domain at the same time. I have this idea to sort the URLs in a way that items which are close to one another (having the same sort key = domain name) are placed as far apart from each other in the list as possible.
An example with numbers:
a=[1,1,2,3,3] # <== sorted list, sortness score = 2
0,1,2,3,4 # <== positions
could be unsorted as:
b=[1,3,2,1,3] # <== unsorted list, sortness score = 6
0,1,2,3,4 # <== positions
I would say that we can compute a sortness score by summing up the distances between equal items (which have the same key = domain name). Higher sortness means better unsorted. Maybe there is a better way for testing unsortness.
The sortness score for list a
is 2. The sum of distances for 1 is (1-0)=1, for 2 is 0, for 3 is (4-3)=1.
The sortness score for list b
is 6. The sum of distances for 1 is (3-0)=3, for 2 is 0, for 3 is (4-1)=3.
URLs list would look something like a list of (domain, URL) tuples:
[
('example.com', 'http://example.com/404'),
('test.com', 'http://test.com/404'),
('test.com', 'http://test.com/405'),
('example.com', 'http://example.com/405'),
...
]
I am working on a prototype which works Ok-ish, but not optimal as I can find some variants which are better unsorted by hand.
Anyone wants to give it a go?
This is my code, but it's not great :):
from collections import Counter
from collections import defaultdict
import math
def test_unsortness(lst:list) -> float:
pos = defaultdict(list)
score = 0
# Store positions for each key
# input = [1,3,2,3,1] => {1: [0, 4], 3: [1, 3], 2: [2]}
for c,l in enumerate(lst):
pos[l].append(c)
for k,poslst in pos.items():
for i in range(len(poslst)-1):
score += math.sqrt(poslst[i+1] - poslst[i])
return score
def unsort(lst:list) -> list:
free_positions = list(range(0,len(lst)))
output_list = [None] * len(free_positions)
for val, count in Counter(lst).most_common():
pos = 0
step = len(free_positions) / count
for i in range(count):
output_list[free_positions[int(pos)]] = val
free_positions[int(pos)] = None # Remove position later
pos = pos + step
free_positions = [p for p in free_positions if p]
return output_list
lsts = list()
lsts.append( [1,1,2,3,3] )
lsts.append( [1,3,2,3,1] ) # This has the worst score after unsort()
lsts.append( [1,2,3,0,1,2,3] ) # This has the worst score after unsort()
lsts.append( [3,2,1,0,1,2,3] ) # This has the worst score after unsort()
lsts.append( [3,2,1,3,1,2,3] ) # This has the worst score after unsort()
lsts.append( [1,2,3,4,5] )
for lst in lsts:
ulst = unsort(lst)
print( ( lst, '%.2f'%test_unsortness(lst), '====>', ulst, '%.2f'%test_unsortness(ulst), ) )
# Original score Unsorted score
# ------- ----- -------- -----
# ([1, 1, 2, 3, 3], '2.00', '====>', [1, 3, 1, 3, 2], '2.83')
# ([1, 3, 2, 3, 1], '3.41', '====>', [1, 3, 1, 3, 2], '2.83')
# ([1, 2, 3, 0, 1, 2, 3], '6.00', '====>', [1, 2, 3, 1, 2, 3, 0], '5.20')
# ([3, 2, 1, 0, 1, 2, 3], '5.86', '====>', [3, 2, 1, 3, 2, 1, 0], '5.20')
# ([3, 2, 1, 3, 1, 2, 3], '6.88', '====>', [3, 2, 3, 1, 3, 2, 1], '6.56')
# ([1, 2, 3, 4, 5], '0.00', '====>', [1, 2, 3, 4, 5], '0.00')
PS. I am not looking just for a randomize function and I know there are crawlers which can manage domain loads, but this is for the sake of exercise.