How to (optimally) pick a single normalized random word from a file with bash / sed / shuf?

Question

I'm looking to remove any non-alphabetic (English) characters and make the output lower-case from /usr/share/dict/words. Here's what I have so far:

sed "$(shuf -i "1-$(cat /usr/share/dict/words | wc -l)" -n 1)q;d" /usr/share/dict/words | tr '[:upper:]' '[:lower:]' | sed 's/[^-a-z]//g'

This works fine but is it possible to do it all in the one sed command?

EDIT: The American word file looks like this:

A
A's
AMD
AMD's
AOL
AOL's
AWS
AWS's
Aachen
Aachen's

I'm looking to make this lower-case and remove any non-alphabetic characters (as mentioned in my original question). The solution I have works fine but I'm hoping to reduce the number of commands (maybe just sed?). Output of the above would then be:

a
as
amd
amds
aol
aols
aws
awss
aachen
aachens

`sed` can do `tr` but it can't easily be made to implement `shuf` or `wc`, so, no, unlikely you can do it all in the one `sed` command — jhnc, May 12 '21 at 17:33
`I'm hoping to reduce the number of commands` what for? Do a function - it will one command then. `do it all in the one sed command?` sed is turning complete, but any realistic sed script that would solve this will be hundreds of pages long, mostly because of missing arithmetic in sed. — KamilCuk, May 12 '21 at 17:34
I know sed can't do `shuf` so I should have been more specific. I am piping sed output into sed and tr so I know there's some optimization that could be done with that but I'm not sed-savvy enough (yet) to know that. I suppose I'll just figure it out myself and post it when I do — Neil C. Obremski, May 12 '21 at 17:39
`tr` should be way faster then `sed`, there's nothing to optimize. https://stackoverflow.com/questions/4569825/sed-one-liner-to-convert-all-uppercase-to-lowercase - does this answer your question? — KamilCuk, May 12 '21 at 17:40

score 2 · Accepted Answer · answered May 12 '21 at 17:52

2

You don't need sed and wc -- shuf can shuffle the lines of a file.
tr can remove non-alphas, so again don't need sed

shuf -n1 /usr/share/dict/words | tr -dc '[:alpha:]' | tr '[:upper:]' '[:lower:]'

answered May 12 '21 at 17:52

glenn jackman

238,783
38
220
352

This is fantastic and it works on all languages. Well done and thank you! – Neil C. Obremski May 12 '21 at 19:55

score 1 · Answer 2 · answered May 12 '21 at 19:02

1

This single awk command should do the job:

awk '{gsub(/[^[:alpha:]]+/, ""); print tolower($0)}' file

a
as
amd
amds
aol
aols
aws
awss
aachen
aachens

answered May 12 '21 at 19:02

anubhava

761,203
64
569
643

score 1 · Answer 3 · answered May 13 '21 at 08:37

1

This might work for you (GNU sed and shuf):

shuf -n1 /usr/share/dict/words | sed 's/[^[:alpha:]-]//g;s/.*/\L&/'

Choose a random line, remove any non-alpha (except hyphen) characters and lowercase the result.

answered May 13 '21 at 08:37

potong

55,640
6
51
83

This is pretty good. Something I tried (and failed) to figure out is how to do all of that `sed` stuff to a specific line from a larger file. I got around the issue by piping it in like you're doing here but I wondered, if I know line 17 is what I want then why can't I write `17q;s/...blahblah.../` ... it just didn't do what I expected – Neil C. Obremski May 13 '21 at 16:14
@NeilC.Obremski it is the other way round but also using the `-n` option and `p` flag on the substitution command e.g. `sed -n '17{s/[^[:alpha:]-]//g;s/.*/\L&/p;q}' file` – potong May 13 '21 at 22:30

How to (optimally) pick a single normalized random word from a file with bash / sed / shuf?

3 Answers3