8

I have a two lists of strings like the following:

test1 = ["abc", "abcdef", "abcedfhi"]

test2 = ["The", "silver", "proposes", "the", "blushing", "number", "burst", "explores", "the", "fast", "iron", "impossible"]

The second list is longer, so I want to downsample it to the length of the first list by randomly sampling.

def downsample(data):
    min_len = min(len(x) for x in data)
    return [random.sample(x, min_len) for x in data]

downsample([list1, list2])

However, I want to add a restriction that the words chosen from the second list must match the length distribution of the first list. So for the first word that is randomly chosen, it must be of the same length as the first word of the shorter list. The issue here is that replacement is not allowed either.

How can I randomly select n (length of shorter list) elements from test2 which matches the character length distribution of test1? Thanks, Jack

Jack Arnestad
  • 1,845
  • 13
  • 26
  • 2
    I would turn your second list into a dictionary where the key is the string length, that way you can random sample from that list based on the length of strings in your first list, and it's an O(1) lookup – user3483203 Jun 16 '18 at 03:09
  • @user3483203 I'm sorry I'm not sure I follow. I would really appreciate it if you could write it as an answer and I will surely accept it if it works. – Jack Arnestad Jun 16 '18 at 03:13
  • 1
    Can you clarify what you mean by not wanting replacement. Can you not have *any* words from the initial list in the result or is it on an index by index basis? – user3483203 Jun 16 '18 at 03:43
  • 1
    @user3483203 By not wanting replacement, I meant, that if `test1 = ["abc", "abcdef","abcdef", "abcedfhi"]`, then in the downsampled second list, for example, silver could not be repeated twice as it only shows up once in `test2`. – Jack Arnestad Jun 16 '18 at 03:53
  • Thanks for the clarification, I'll update my answer – user3483203 Jun 16 '18 at 03:54

2 Answers2

7

Setup

from collections import defaultdict
import random
dct = defaultdict(list)
l1 = ["abc", "abcdef", "abcedfhi"]
l2 = ["The", "silver", "proposes", "the", "blushing", "number", "burst", "explores", "the", "fast", "iron", "impossible"]

First, use collections.defaultdict to create a dictionary where the key is word length:

for word in l2:
  dct[len(word)].append(word)

# Result
defaultdict(<class 'list'>, {3: ['The', 'the', 'the'], 6: ['silver', 'number'], 8: ['proposes', 'blushing', 'explores'], 5: ['burst'], 4: ['fast', 'iron'], 10: ['impossible']})

Then you may use a simple list comprehension along with random.choice to select a random word that matches the length of each element in your first list. If a word length is not found in your dictionary, fill with -1:

final = [random.choice(dct.get(len(w), [-1])) for w in l1]

# Output
['The', 'silver', 'blushing']

Edit based on clarified requirements
Here is an approach that fulfills the requirements of not allowing duplicates if a duplicate does not exist in list 2:

for word in l2:
    dct[len(word)].append(word)

for k in dct:
    random.shuffle(dct[k])

final = [dct[len(w)].pop() for w in l1]
# ['The', 'silver', 'proposes']

This approach will raise an IndexError if not enough words exist in the second list to fulfill the distribution.

user3483203
  • 50,081
  • 9
  • 65
  • 94
1

One way may be to create list of length of items in test1. Then, use it to create other list that contains sublist of those length from test2. And finally randomly pop from the list of lists (following similar answer), so that item is removed once selected for the sample.

from random import randrange

test1 = ["abc", "abcdef", "abcedfhi"]
test2 = ["The", "silver", "proposes", "the", "blushing", "number", "burst", "explores", "the", "fast", "iron", "impossible"]

sizes = [len(i) for i in test1]
# results: [3, 6, 8]

sublists = [[item for item in test2 if len(item) == i] for i in sizes ]
# results for sublists: [['The', 'the', 'the'], ['silver', 'number'], ['proposes', 'blushing', 'explores']]

# randomly pop from the list for samples 
samples = [i.pop(randrange(len(i)))  for i in sublists]

print('Samples: ',samples)

Result:

Samples:  ['the', 'number', 'blushing']
niraj
  • 17,498
  • 4
  • 33
  • 48