randomly sample text string based on matching prefix bash

Question

I have a list and I want to randomly select one text string for every unique prefix. For example, here's my list:

apples_1
apples_2
apples_3
banana_1
banana_2
pears_3

For each unique prefix (apples, banana, pears) I want to randomly select one. The desired output would then be:

apples_3
banana_1
pears_3

I've seen similar posts here and here on SO using arrays but it's unclear to me how to apply those answers here. I'm completely lost on how to go about doing this. Any suggestions to get me started would be greatly appreciated.

EDIT: per the user comment to show what I've tried:

Attempting to apply the SO arrays links above:

ARRAY=(filename.txt)
N1=$((RANDOM % 5))
SDFFILE=${ARRAY[$N1]}
echo $SDFFILE

Per the posts, I assumed the above would return 5 random lines of text and I would attempt to build out from there. Nothing happened and I couldn't follow those answers to troubleshoot.

Then I starting thinking I would split my text strings and select one of the unique prefix lines.

cut -d'_' -f 1

Above returns all the prefixes (duplicates now) but obviously if I subsample from that list it won't be subsampling as it would only return:

apples
banana
pears

I'm having a hard time thinking through how to implement this. Thanks.

Gilles Quénot · Answer 1 · 2020-07-22T23:23:00.560

0

What I would do, if you know the prefixes in bash:

for fruit in apples banana pears; do
    grep "$fruit" Input_File | shuf | head -1
done

apples_1
banana_1
pears_3

edited Jul 22 '20 at 23:23

answered Jul 22 '20 at 23:16

Gilles Quénot

173,512
41
224
223

score 0 · Accepted Answer · answered Jul 22 '20 at 23:16

The most straightforward way is to use sort -R (GNU sort) in order to shuffle your file.

First the list of prefixes sorted randomly:

# sort -t_ -k1,1R filename.txt
apples_1
apples_2
apples_3
pears_3
banana_1
banana_2

You want to keep the first line for each prefix, use the -u option:

# sort -t_ -k1,1R -u filename.txt
pears_3
banana_1
apples_1

The problem now is that the second field, after the delimiter "_" is kept as is, in its original order. Therefore the -u option will always output the same line for each prefix.

The solution is to shuffle the input file first:

# sort -R filename.txt | sort -t_ -k1,1R -u
pears_3
apples_3
banana_2

PS: in your first attempt, ARRAY=(filename.txt) won't populate ARRAY with the content filename.txt. Use readarray (or its alias mapfile) for this:

# readarray -t ARRAY < filename.txt

This works great for me as I have many unique prefixes! To clarify: the `-k1,1` is to sort only by the first field, otherwise lines are sorted by all fields (before and after the _), correct? — KNN, Jul 22 '20 at 23:53
Correct. And this limitation to the first field is also why the `-u` flag worfks.. — xhienne, Jul 23 '20 at 01:53

randomly sample text string based on matching prefix bash

2 Answers2