Delete letter spacing in a regular text file

Question

I have a text file with lot of rows with letter spacing, i.e.

cat test.txt
Some word here: T h e Q u i c k B r o w n F o x J u m p s O v e r T h e L a z y D o g
Some doggerel: J a c k A n d J i l l W e n t U p T h e H i l l

I ask some regular expression to apply to this text file to delete spacing between characters, with some command line tool in Linux.

cat result.txt
Some word here: The Quick Brown Fox Jumps Over The Lazy Dog
Some doggerel: Jack And Jill Went Up The Hill

Thank you

Also, (might be a typo on your part but) why is there a space after `Some word here: ` but no space after `Some doggerel:` in result.txt? — TrebledJ, Dec 15 '18 at 10:57
How could be possible that sed, or awk, are able to discriminate the space between `J` and `u` (in `Jumps`), that is to be removed from the space between `x` and `J` (`Fox Jumps`) that is to be preserved? TL;DR with the tools you mentioned your task is impossible. — gboffi, Dec 15 '18 at 11:02
@gboffi I'm guessing its the spaces before capital letters which should be preserved. But OP should be more explicit. Also, I think it's possible with `sed`, but again would depend on OP's actual requirements. — TrebledJ, Dec 15 '18 at 11:04
@TrebuchetMS Ooops, I stand corrected. If this is what they want, it can be done... — gboffi, Dec 15 '18 at 11:11
In same row, I have normal text (i.e. without spacing between characters) and letter-spacing-text (with space between characters) — Peter, Dec 15 '18 at 12:15
@TrebuchetMS: i would like to preserve space before capital letters, if it's possible and it's not particulary difficult. — Peter, Dec 15 '18 at 12:19

score 3 · Answer 1 · answered Dec 15 '18 at 11:25

3

If what you want is what was divined by TrebuchetMS in this comment, it's not difficult using awk:

$ awk -F: '{gsub(/ /,"",$2); gsub(/[A-Z]/," &",$2) ; print $1":"$2}' file.txt

The one-line program ① split a line on :, ② erases all the spaces after the :, ③ puts a space in front of each capital letter (also in front of the first one) and ④ prints the concatenation of $1 (what precedes the :), a : and $2, that is, the modified second part.

answered Dec 15 '18 at 11:25

gboffi

22,939
8
54
85

In some locales `A-Z` includes most lower case letters because they are ordered as `aAbBcC...zZ` or similar rather than `abc...zABC...Z`. Use the character class `[[:upper:]]` instead of `[A-Z]` to avoid that problem. – Ed Morton Dec 15 '18 at 15:26

TrebledJ · Answer 2 · 2018-12-16T12:05:00.603

I mentioned in the comments that you could use sed for this. After trying it out, I lost hope in sed since I couldn't get lookarounds to work in their regexes. Apparently, the perl command can parse regexes with lookarounds. If you have the perl command, you can try this

perl -pe 's/ ([a-z])(?= |$)/\1/g' file.txt

or

cat file.txt | perl -pe 's/ ([a-z])(?= |$)/\1/g'

What in the world does this fence post mean?

The perl option -e tells the perl command to accept a script (that's the monstrous regex you see right after it) and -p will loop the script around the file. (I'm no perl expert, so I need someone to double check this, I only looked at perl -h for help.)

Now the regex.

The s/<match>/<replace>/g follows sed's syntax. It'll search globally for the <match> and substitute it with the <replace>.

Here, the match was ([a-z])(?= |$), which tells perl to match places with spaces followed by a lower-case letter (([a-z]), where [a-z] denotes the set of characters to match, and () denotes a capture group, used in the <replace> section).

And to make sure that what follows is either a space or the end of the line ((?= |$)), that's the [positive] lookahead I was referring to before. The vertical bar implies "or". Thus, the lookahead will search for a space () "or" the end of the line ($). The lookahead ensures the correct match while not including the space/end in the match.

The replace was \1, which will replace the match with the first capture group. In this case, the capture group is whatever lower-case letter was matched.

Why this regex works

If you look at the first line of your text file:

Some word here: T h e Q u i c k B r o w n F o x J u m p s O v e r T h e L a z y D o g

We only want to match lower-case letters, which have a space after them, i.e. a-z. If we only match a-z, that will include Some, word, and here. So we match lower-case letters, with spaces at the front and back. We remove the first space by matching it, by only replacing the letter, dropping the space.

Limitations of this regex

If your file had

Lol a word here: T h e Q u i c k B r o w n F o x J u m p s O v e r T h e L a z y D o g

then the output would include:

Lola word here: The Quick Brown Fox Jumps Over The Lazy Dog

not as accurate as gboffi's answer in that it matches after the colon, but still regexes are a short hack ¯\_(ツ)_/¯.

Further Reading: Reference: What does this regex mean?

score 2 · Answer 3 · answered Dec 15 '18 at 13:05

This might work for you (GNU sed):

 sed -r ':a;s/^(.*: .*) ([[:lower:]])/\1\2/;ta' file

Replace all cases of a space followed by a lower-case character by a the lower-case character following a : in the current line. This solution works its way back along the line until it fails when all cases have been catered for.

score 2 · Answer 4 · answered Dec 15 '18 at 15:17

With GNU awk for gensub():

$ awk 'BEGIN{FS=OFS=":"} {$2=gensub(/ ([^[:upper:]])/,"\\1","g",$2)}1' file
Some word here: The Quick Brown Fox Jumps Over The Lazy Dog
Some doggerel: Jack And Jill Went Up The Hill

with any awk:

$ awk 'BEGIN{FS=OFS=":"} {gsub(/ /,"",$2); gsub(/[[:upper:]]/," &",$2)}1' file
Some word here: The Quick Brown Fox Jumps Over The Lazy Dog
Some doggerel: Jack And Jill Went Up The Hill

score 1 · Answer 5 · answered Dec 15 '18 at 13:14

Here is one more variant using Perl

$ cat peter.txt
Some word here: T h e Q u i c k B r o w n F o x J u m p s O v e r T h e L a z y D o g
Some doggerel: J a c k A n d J i l l W e n t U p T h e H i l l

$ perl -F":" -lane ' $F[1]=~s/ //g; $F[1]=~s/([A-Z])/ \1/g; print "$F[0]:$F[1]" ' peter.txt
Some word here: The Quick Brown Fox Jumps Over The Lazy Dog
Some doggerel: Jack And Jill Went Up The Hill

score 1 · Answer 6 · answered Dec 16 '18 at 06:57

1

This problem can be solve by many different way. The easiest way I can think of is just remove blank before lower case. I have tried using SED as TrebuchetMS mentioned "SED didn't have lookarounds in their regexes"

echo "T h e Q u i c k B r o w n F o x J u m p s O v e r T h e L a z y D o g" |  sed 's/[[:blank:]]\([[:lower:]]\)/\1/g'

Output: The Quick Brown Fox Jumps Over The Lazy Dog

answered Dec 16 '18 at 06:57

Mohit Rathore

428
3
10

You missed the one thing that makes it difficult to do with sed which is that the substitutions must only happen in the substring after the first `:`. Try your suggestion with the **actual** sample input to see how it fails. – Ed Morton Dec 17 '18 at 15:33

Delete letter spacing in a regular text file

6 Answers6

What in the world does this fence post mean?

Why this regex works

Limitations of this regex