Script to remove a set number of lines in a text file?

Question

I'm new to the community, so bear with me. I have a text file with a bit over 2 million lines. The file has a header, 50 lines of actual data, and then 10 lines showing the same header, page number, date, and other info i don't need from the application i'm using to generate the file. and then over and over again.

Is it possible to use a script to remove 10 lines at every 50 lines?

Can you give the information about the lines that you want to exclude ? Sample content will help. — vijayalakshmi d, Feb 05 '15 at 00:45
Agreed. Knowing what's unique about the lines you want to exclude, some sort of sequence that can be used to toggle wanted vs. unwanted would be extremely helpful. — rojo, Feb 05 '15 at 01:00
If you simply want to skip every first 10 lines out of 60 lines, following awk command will be useful. awk '{ if (NR % 60 > 10) { print $_ } } ' < file name> — vijayalakshmi d, Feb 05 '15 at 01:03
@vijayalakshmi d, the `awk` command you presented does not do what the OP is asking. — user3439894, Feb 05 '15 at 01:17

score 1 · Answer 1 · edited May 23 '17 at 10:26

You can do that with a Batch file that use a couple third-party .exe programs. The trick consist in redirect the file into both the Stdin and Stdout of a subroutine, so the processing of the file can be performed in the subroutine moving the file pointer of standard handles in an appropriate way. You may review an example of a similar method at this post.

@echo off
setlocal EnableDelayedExpansion

if "%~1" equ ":ProcessFile" goto %1

set /A keep=50, delete=10

rem Invoke a subroutine to process the file via redirected Stdin and Stdout
rem use CMD /C so the loop inside it can be broken with EXIT /B

cmd /C call "%~F0" :ProcessFile < theFile.txt >> theFile.txt
goto :EOF


:ProcessFile

rem Initialize the process: preserve first N lines in Stdin
for /L %%i in (1,1,%keep%) do set /P "line="
rem ...and move Stdout file pointer to the same place
FilePointer 0 0 /C
FilePointer 1 %errorlevel%

rem Process the rest of lines in an endless loop
for /L %%_ in ( ) do (

   rem Read M lines without copy they (delete they)
   rem (advance just Stdin file pointer)
   for /L %%i in (1,1,%delete%) do set /P "line="

   rem ...and read and copy the next N lines
   rem (both Stdin and Stdout advance the same amount)
   for /L %%i in (1,1,%keep%) do set /P "line=!line!"

   rem Check for the EOF in Stdin after the last block copied
   set "line="
   set /P "line="
   if not defined line (
      rem EOF detected: truncate the Stdout file after the last written line
      TruncateFile 1
      rem ...and terminate
      exit /B
   )

)

An interesting aspect of this method is that the processing is achieved in the same file, that is, the process does not require additional space to store the output file. Sections of data are moved from one place to another place in the same file, and at the end the remaining space is truncated. Of course, this method destroy the original file, so you should copy it before using this program.

It is probable that this code have an error of one line defased in each section copied or deleted, but it is much simpler to run a test and adjust the values accordingly. I suggest you to create a file with 4 or 5 sections and use it for tests. Also, the method to detect the end of file may require some adjustment. If you post the results produced from the test I may help you to fix these details.

You may read a further description of this stuff and download the FilePointer.exe and TruncateFile.exe auxiliary programs at this site.

score 0 · Answer 2 · answered Feb 05 '15 at 03:24

Here's an awk script which sends commands to ed that removes H # of lines with T number of lines preserved in between each header section:

awk -v sz="`cat file.txt | wc -l`" -v H=10 -v T=40 'BEGIN { 
  print "w"
  idx=1
  while(idx<sz) {
    print idx "," idx+H-1 "d"
    idx+=(H+T)
  }
}' | cat -n | sort -rn | cut -f2- | ed file.txt

Here, H is the # of header lines to remove, and T is the # of remaining lines until the next header section.

The cat -n | sort -rn | cut -f2- pipeline is a trick to reverse the ordering of the output produced by the awk (last line is first, 2nd to last is 2nd, etc.).

Script to remove a set number of lines in a text file?

2 Answers2