10

I'm trying to do something like this but for quoted emails, so this

On 2014-07-11 at 03:36 PM, <ilovespaces@email.com> wrote:                                                                                                                                                                                                                                                       
>Hi Everyone,                                                                                                                                                                                                                                                                                                                 
>                                                                                                                                                                                                                                                                                                                             
>                                                                                                                                                                                                                                                                                                                              
>                                                    
>I love spaces.
>                                                                                                                                                                                                                                                                                                                             
>                                                                                                                                                                                                                                                                                                                          
>                                                                                                                                                                                                                                                                                                                          
>That's all.                                                                                                                                                                                                                                                                                                                       

Would become this

On 2014-07-11 at 03:36 PM, <ilovespaces@email.com> wrote:                                                                                                                                                                                                                                                       
>Hi Everyone,                                                                                                                                                                                                                                                                                                                 
>                                                                                                                                                                                                                                                                                                                             
>I love spaces.
>                                                                                                                                                                                                                                                                     
>That's all.   

Thanks

Community
  • 1
  • 1
user3843237
  • 109
  • 2

5 Answers5

14

Assuming that each visual line is a proper logical line (string of characters ended with a \n), you can dispense with the rest of the tools and simply run uniq(1) on the input.

Example follows.

% cat tst
>Hi Everyone,
>
>
>
>I love spaces.
>
>
>
>That's all.

% uniq tst
>Hi Everyone,
>
>I love spaces.
>
>That's all.
%
Noufal Ibrahim
  • 71,383
  • 13
  • 135
  • 169
  • 2
    Thanks. One of the reasons why I emphasise UNIX 101 in all the mentoring courses that I conduct. – Noufal Ibrahim Jul 16 '14 at 08:15
  • Although this is a good answer(probably the best one), if you were to copy the sample provided by OP, then each line has a different number of spaces on them, meaning all the lines would be printed –  Jul 16 '14 at 08:31
  • Yes. Which is why I posted the caveat about visual and logical lines. Maybe a preprocessing filter to clear out all trailing whitespaces would fix the problem but that would compromise the sheer simplicity of the answer. :) – Noufal Ibrahim Jul 16 '14 at 08:33
  • 1
    This depends on an assumption: there will never be two continuous lines exactly the same in a mail. – WKPlus Jul 16 '14 at 09:07
  • Well in such case you can firstly squeeze spaces: `tr -s ' ' < file | uniq`. – fedorqui Jul 16 '14 at 09:36
  • @fedorqui OK for trailing spaces, but it could also remove duplicates of lines containing similar text content (which should not be deleted). – Qeole Jul 17 '14 at 00:05
2

Try this:

sed -r '/^>\s*$/{N;/^>\s*\n>\s*$/D}'

Here is the explanation:

Commands used:

  1. N Append the next line of input into the pattern space.
  2. D Delete up to the first embedded newline in the pattern space. Start next cycle, but skip reading from the input if there is still data in the pattern space.

Patterns used:

  1. /^>\s*$/ matches a line contains '>' with zero or more spaces followed
  2. /^>\s*\n>\s*$/ matches two continuous lines contains > with zero or more spaces followed when using together with N

So the above sed command's work flow is:

  1. read a line into pattern space(if meets the end of file, exit)
  2. if pattern space only contains '>' go to step 4 else go to step 3
  3. print the context in pattern space and go to step 1
  4. append '\n' and next line to pattern space, if the pattern space only contains '>\n>'(which means we meet two continuous '>' lines) go to step 5 else go to step 3
  5. delete the context before '\n'(included) and then go to step 2
WKPlus
  • 6,955
  • 2
  • 35
  • 53
2
sed '/^>\s\s*$/d;$b;/^[^>]/b;a>'  input

Means:

/^>\s\s*$/d: Delete all lines with a single > and whitespace.

$b;/^[^>]/b: Print and skip the last line, an lines not starting with >.

a>: Add a > after all other lines.

Gives:

On 2014-07-11 at 03:36 PM, <ilovespaces@email.com> wrote:
>Hi Everyone,
>
>I love spaces.
>
>That's all.     
perreal
  • 94,503
  • 21
  • 155
  • 181
1

Another awk-based solution:

awk '{ /^>\s*$/?b++:b=0; if (b<=1) print }' file

Breakdown:

/^>\s*$/?b++:b=0
    - ? :       the ternary operator
    - /^>\s*$/  matches a blank line starts with ">"
    - b         variable that counts consecutive blank lines (b++).
                however, if the current line is non-blank, b is reset to 0.


if (b<=1) print
    print if the current line is non-blank (b==0)
          or if there is only one blank line (b==1).
Shaoyun
  • 91
  • 4
0

awk way

This actually takes into account the spaces unlike other answers(except perreals :)) It also doesnt just insert a > after every line with more than > on it (meaning that if there were multiple lines with text, blank lines would not be inserted between them.)

awk 'a=/^>[ ]*$/{x=$1}!a&&x{print x;x=0}!a' file

Explanation

a=/^>[ ]*$/                    Sets a to pattern. Pattern is begins with > and 
                               then has  only spaces till end

{x=$1}                        Sets x to $1.

!a&&x                         While it does not match a(the pattern) and x is 0

{print x;x=0}                 Print x(>) and set x to zero

!a                            If it is not a(the pattern) print the line

The way this work is it sets x to > when it finds a line containing only > and spaces.
Then Carries on until it finds a line that doesn't match, prints > and prints the line. This resets everytime it finds the pattern again

Hope this helps :)