0

I need to strip blank lines from only the first 6 lines of a text file. I've attempted to cobble together a solution using this StackOverflow question and this file but to no avail.

Here's the sed script I'm using (aliased as faprep='~/misc-scripts/fa-prep.sed), the last command is the one that's failing:

#!/opt/local/bin/sed -f

# Title Treatments
s|<\(/\?\)h1[^>]*\?>|[\1b]|g    # Replace <h1></h1> with [b][/b] for saga titles
s|<\(/\?\)h2[^>]*\?>|[\1i]|g    # Replace <h2></h2> with [i][/i] for arc titles
s|</\?h3[^>]*\?>||g             # Strip <h3 id=""></h3> out without removing chapter title text

# HTML tag strips & substitutions
s|</\?p>||g                 # Strip all <p></p> tags
s|<\(/\?\)em>|[\1i]|g       # Change <em></em> to [i][/i]
s|<\(/\?\)strong>|[\1b]|g   # Change <strong></strong> to [b][/b]

# Character code substitutions
s/&\#822[01];/\"/g  # Replace &#8220; and &#8221; with straight double quote (")
s/&\#8217;/\'/g     # Replace &#8217; with straight single quote (')
s/&\#8230;/.../g    # Replace &#8230; with a 3-period ellipsis (...)
s/&\#821[12];/--/g  # Replace &#8212; with a 2-hyphen em dash (--)

# Final prep; stripping out unnecessary cruft
/<body>/,/<\/body>/!d   # Delete everything OUTSIDE the <body></body> tags
/<\/\?body>/d           # Then, delete the body tags :3

# Pay attention to meeeeeeee!!!!
1,6{/./!d}      # Remove blank lines from around titles??

Here's the command I'm running from terminal, which shows the last line failing to strip whitespace from the first 6 lines of the file (after all of the other modifications have been made, of course):

calyodelphi@dragonpad:~/pokemon-story/compilations $ ch='ch6'; faprep $ch-mmd.html > $ch-fa.txt; head -6 $ch-fa.txt

[b]Hoenn Saga (S1)[/b]

[i]Next City Arc (A2)[/i]

Chapter 6: A Peaceful City Stroll... Or Not
calyodelphi@dragonpad:~/pokemon-story/compilations $

The rest of the file is composed of a blank line after the third title and then paragraphs all separated by blank lines. I want to keep those blank lines, so that only the blank lines between the titles at the very top are stripped.

Just to clarify a few points: this file has Unix line endings, and the lines are supposed to not have spaces. Even viewing in a text editor that shows whitespace, each blank line contains only a newline character.

Community
  • 1
  • 1
Calyo Delphi
  • 329
  • 1
  • 3
  • 16
  • Wild guess: does the input file have windows line endings? (are you perhaps working with cygwin?) Oh, and just to be sure: There are no spaces in those lines, are there? – Wintermute Mar 17 '15 at 19:02
  • Just edited my question to answer yours. ^..^ They're unix endings (Mac OS X) and the blank lines contain only a newline char. – Calyo Delphi Mar 17 '15 at 19:05
  • 1
    MacOS X's sed is a bit picky about a great many things. I think you'll need a semicolon there -- try `1,6{/./!d;}`, does that work? Although I'd have expected sed to complain about a syntax error in that case. – Wintermute Mar 17 '15 at 19:15
  • Hold on a moment; the `//,/<\/body>/!d` is making me suspicious. Do you want to ignore empty lines in the first six lines of the body tag? Because `1,6` applies to the first six lines of the whole input, not to the first six lines where the script gets that far. – Wintermute Mar 17 '15 at 19:25
  • 2
    Parsing HTML with regular expressions: http://stackoverflow.com/a/1732454/7552 – glenn jackman Mar 17 '15 at 19:29
  • @glennjackman Alas, that HTML is not parseable with regular expressions is a subset of the greater problem that HTML is not parseable at all. – Wintermute Mar 17 '15 at 19:36
  • @Wintermute I have GNU sed on OS X courtesy of macports, actually. That line just deletes everything outside the tags. Then the following line deletes the tags themselves, so that the actual titles and text of the chapter are what's left. – Calyo Delphi Mar 17 '15 at 19:38
  • @glennjackman Thaaaat link doesn't help. I'm not trying to parse HTML at all at this point. I'm just trying to remove some blank lines from within the first few lines of text. No HTML parsing at all. :\ – Calyo Delphi Mar 17 '15 at 19:39
  • @CalyoDelphi Presumably the `` tag doesn't begin in the first line, so by the time you start processing the data in the body, at least part of the `1,6` range is already over. Do you mean to ignore empty lines in the first six lines of the body tag or in the first six lines of the whole input? Because the `1,6` approach is only applicable to the latter case, and knocking something together for the former is a little tricky, so I'm only going to do it if I know that it's what you want. – Wintermute Mar 17 '15 at 19:43
  • In that case. use 2 separate sed commands: `sed 'parse html and output the body' | sed '1,6{/^$/d;}'` – glenn jackman Mar 17 '15 at 19:47
  • @Wintermute As far as I can tell from the script's behavior, by the time it gets to that last line it's already removed the `` tags and everything outside of them, so that the only stuff left is the text (titles & paragraphs) that was inside the `` tags. So the last `1,6{/./!d}` line is effectively acting on that rather than the first 6 lines of the original input file. – Calyo Delphi Mar 17 '15 at 19:50
  • 1
    @CalyoDelphi You'd think so, but no. Using a second sed process is a solution, but I can whip up something to keep it in one. Hold on a moment. – Wintermute Mar 17 '15 at 19:51

2 Answers2

0

Since the discussion in the comments made it clear that you want to ignore empty lines in the first six lines of the body tag -- in other words, the first six times that part of the script is reached -- rather than the first six lines of the overall input data, you cannot use the global line counters. Since you're not using the hold buffer, we can use it to build our own counter, though.

So, replace

1,6 { /./! d }

with

x               # swap in hold buffer
/.\{6\}/! {     # if the counter in it hasn't reached 6
  s/^/./        # increment by one (i.e., append a character)
  x             # swap the input back in
  /./!d         # if it is empty, discard it
  x             # otherwise swap back
}
x               # and swap back one more time. This dance ensures that the
                # line from the input is in the pattern space when we drop
                # out at the bottom to the printing, regardless of which
                # branches were entered.

Or, if this seems too complicated, use @glennjackman's suggestion and pipe the output of the first sed script through sed '1,6 { /./! d; }', since the second process will have its own line counters working on the preprocessed data. There's no fun in it, but it'll work.

Wintermute
  • 42,983
  • 5
  • 77
  • 80
  • Ah, one of your comments actually pointed me in the right direction towards a different solution! (Which I'll provide in an answer to my own question) But my roommate glanced over your solution and complimented that it's also a viable approach. :) – Calyo Delphi Mar 17 '15 at 20:07
0

This answer courtesy of @Wintermute's comments on my question pointing me in the right direction! I was mistakenly thinking that sed was working on the modified stream when I put that delete statement in at the very end. When I tried a different address (lines 9,14) it worked perfectly, but was too hackish for me to settle on. But this confirmed I needed to think of the stream as still including lines that I thought were already gone.

So I moved the delete statement up above the statement that clears out the <body> tags and everything outside them, and used a regex and the addr1,+N trick here to produce this final result:

The script:

#!/opt/local/bin/sed -f

# Title Treatments
s|<\(/\?\)h1[^>]*\?>|[\1b]|g    # Replace <h1></h1> with [b][/b] for saga titles
s|<\(/\?\)h2[^>]*\?>|[\1i]|g    # Replace <h2></h2> with [i][/i] for arc titles
s|</\?h3[^>]*\?>||g             # Strip <h3 id=""></h3> out without removing chapter title text

# HTML tag strips & substitutions
s|</\?p>||g                 # Strip all <p></p> tags
s|<\(/\?\)em>|[\1i]|g       # Change <em></em> to [i][/i]
s|<\(/\?\)strong>|[\1b]|g   # Change <strong></strong> to [b][/b]

# Character code substitutions
s/&\#822[01];/\"/g  # Replace &#8220; and &#8221; with straight double quote (")
s/&\#8217;/\'/g     # Replace &#8217; with straight single quote (')
s/&\#8230;/.../g    # Replace &#8230; with a 3-period ellipsis (...)
s/&\#821[12];/--/g  # Replace &#8212; with a 2-hyphen em dash (--)

# Final prep; stripping out unnecessary cruft
/<body>/,+6{/^$/d}      # Remove blank lines from around titles
/<body>/,/<\/body>/!d   # Delete everything OUTSIDE the <body></body> tags
/<\/\?body>/d           # Then, delete the body tags :3

And the resulting output:

calyodelphi@dragonpad:~/pokemon-story/compilations $ ch='ch6'; faprep $ch-mmd.html > $ch-fa.txt; head -6 $ch-fa.txt
[b]Hoenn Saga (S1)[/b]
[i]Next City Arc (A2)[/i]
Chapter 6: A Peaceful City Stroll... Or Not

The next two weeks of training passed by too quickly and too slowly at the same time. [rest of paragraph omitted for space]

calyodelphi@dragonpad:~/pokemon-story/compilations $

Thanks @Wintermute! :D

Calyo Delphi
  • 329
  • 1
  • 3
  • 16