I mentioned in the comments that you could use sed
for this. After trying it out, I lost hope in sed
since I couldn't get lookarounds to work in their regexes. Apparently, the perl
command can parse regexes with lookarounds. If you have the perl
command, you can try this
perl -pe 's/ ([a-z])(?= |$)/\1/g' file.txt
or
cat file.txt | perl -pe 's/ ([a-z])(?= |$)/\1/g'
What in the world does this fence post mean?
The perl
option -e
tells the perl
command to accept a script (that's the monstrous regex you see right after it) and -p
will loop the script around the file. (I'm no perl expert, so I need someone to double check this, I only looked at perl -h
for help.)
Now the regex.
The s/<match>/<replace>/g
follows sed
's syntax. It'll s
earch g
lobally for the <match>
and substitute it with the <replace>
.
Here, the match was ([a-z])(?= |$)
, which tells perl
to match places with spaces followed by a lower-case letter (([a-z])
, where [a-z]
denotes the set of characters to match, and ()
denotes a capture group, used in the <replace>
section).
And to make sure that what follows is either a space or the end of the line ((?= |$)
), that's the [positive] lookahead I was referring to before. The vertical bar implies "or". Thus, the lookahead will search for a space (
) "or" the end of the line ($
). The lookahead ensures the correct match while not including the space/end in the match.
The replace was \1
, which will replace the match with the first capture group. In this case, the capture group is whatever lower-case letter was matched.
Why this regex works
If you look at the first line of your text file:
Some word here: T h e Q u i c k B r o w n F o x J u m p s O v e r T h e L a z y D o g
We only want to match lower-case letters, which have a space after them, i.e. a-z
. If we only match a-z
, that will include Some
, word
, and here
. So we match lower-case letters, with spaces at the front and back. We remove the first space by matching it, by only replacing the letter, dropping the space.
Limitations of this regex
If your file had
Lol a word here: T h e Q u i c k B r o w n F o x J u m p s O v e r T h e L a z y D o g
then the output would include:
Lola word here: The Quick Brown Fox Jumps Over The Lazy Dog
not as accurate as gboffi's answer in that it matches after the colon, but still regexes are a short hack ¯\_(ツ)_/¯.
Further Reading: Reference: What does this regex mean?