0

I have a list and I want to randomly select one text string for every unique prefix. For example, here's my list:

apples_1
apples_2
apples_3
banana_1
banana_2
pears_3

For each unique prefix (apples, banana, pears) I want to randomly select one. The desired output would then be:

apples_3
banana_1
pears_3

I've seen similar posts here and here on SO using arrays but it's unclear to me how to apply those answers here. I'm completely lost on how to go about doing this. Any suggestions to get me started would be greatly appreciated.

EDIT: per the user comment to show what I've tried:

  1. Attempting to apply the SO arrays links above:
ARRAY=(filename.txt)
N1=$((RANDOM % 5))
SDFFILE=${ARRAY[$N1]}
echo $SDFFILE

Per the posts, I assumed the above would return 5 random lines of text and I would attempt to build out from there. Nothing happened and I couldn't follow those answers to troubleshoot.

  1. Then I starting thinking I would split my text strings and select one of the unique prefix lines.
cut -d'_' -f 1

Above returns all the prefixes (duplicates now) but obviously if I subsample from that list it won't be subsampling as it would only return:

apples
banana
pears

I'm having a hard time thinking through how to implement this. Thanks.

KNN
  • 459
  • 4
  • 19

2 Answers2

0

What I would do, if you know the prefixes in :

for fruit in apples banana pears; do
    grep "$fruit" Input_File | shuf | head -1
done

apples_1
banana_1
pears_3
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
0

The most straightforward way is to use sort -R (GNU sort) in order to shuffle your file.

First the list of prefixes sorted randomly:

# sort -t_ -k1,1R filename.txt
apples_1
apples_2
apples_3
pears_3
banana_1
banana_2

You want to keep the first line for each prefix, use the -u option:

# sort -t_ -k1,1R -u filename.txt
pears_3
banana_1
apples_1

The problem now is that the second field, after the delimiter "_" is kept as is, in its original order. Therefore the -u option will always output the same line for each prefix.

The solution is to shuffle the input file first:

# sort -R filename.txt | sort -t_ -k1,1R -u
pears_3
apples_3
banana_2

PS: in your first attempt, ARRAY=(filename.txt) won't populate ARRAY with the content filename.txt. Use readarray (or its alias mapfile) for this:

# readarray -t ARRAY < filename.txt
xhienne
  • 5,738
  • 1
  • 15
  • 34
  • This works great for me as I have many unique prefixes! To clarify: the `-k1,1` is to sort only by the first field, otherwise lines are sorted by all fields (before and after the _), correct? – KNN Jul 22 '20 at 23:53
  • Correct. And this limitation to the first field is also why the `-u` flag worfks.. – xhienne Jul 23 '20 at 01:53