7

I have a big file around 60GB.

I need to get n middle lines of the file. I am using a command with head and tail like

tail -m file |head -n >output.txt
where m,n are numbers

The general structure of the file is like below with set of records (comma separated columns.) Each line can be of different length(say max 5000 chars).

col1,col2,col3,col4...col10

Is there any other way that I can take n middle lines with less time, because the current command is taking lot of time to execute?

Rubén
  • 34,714
  • 9
  • 70
  • 166
Mahesh
  • 99
  • 1
  • 5
  • Can you tell us more about the data in your file like the general structure of the file. How are the lines separated? Max size of each line? so that we can try and traverse the memory to the required line directly? If your lines are not equal in length, we'll have to parse it character by character. In that case, you are already using the best possble way. – Ashis Kumar Sahoo Dec 09 '13 at 07:33
  • Added the general structure of the record to the question. – Mahesh Dec 09 '13 at 08:46

6 Answers6

14

With sed you can at least remove the pipeline:

sed -n '600000,700000p' file > output.txt

will print lines 600000 through 700000.

perreal
  • 94,503
  • 21
  • 155
  • 181
  • 2
    If there are a lot of lines _after_ the last requested line, it might help to also add a 'q' command: `sed -n '600000,700000p;700000q' file`. Otherwise, sed will keep running until the last line of the file is read (even if nothing is printed). – geronimo Mar 20 '19 at 14:03
8

awk 'FNR>=n && FNR<=m'

followed by name of the file.

Anitha Mani
  • 863
  • 1
  • 9
  • 17
3

It might be more efficient to use the split utility, because with tail and head in pipe you scan some parts of the file twice.

Example

split -l <k> <file> <prefix>

Where k is the number of lines you want to have in each file, and the (optional) prefix is added to each output file name.

Community
  • 1
  • 1
Rajish
  • 6,755
  • 4
  • 34
  • 51
  • yes, I thought of using this command but my machine doesn't have much space to store the splitted files :( – Mahesh Dec 09 '13 at 10:04
0

The only possible solution I can think of to speed up the search is to build and index of your lines, something like:

 0 00000000
 1 00000013
 2 00000045
   ...
 N 48579344

And then, knowing the index length, you could jump quickly in the middle of your data file (or wherever you like...). Of course you should keep the index updated when the file changes...

Obviously the canonical solution for such a problem would be to keep the data in a DB (see for example SQLite), an not in a plain file... :-)

MarcoS
  • 17,323
  • 24
  • 96
  • 174
  • My intention is to move this data to DB. Because few of the records are not in proper structure and due to some other issues, I am moving them to DB in chunks. – Mahesh Dec 09 '13 at 09:10
0

Having the same problem (mine is an Asterisk Master.csv file), I am affraid there is no trivial solution: when wanting to access the 10,000,000-th line of a file (file, not database record nor in memory representation of the file), whatever have to count from 0 to 10,000,000... :-(

-2

Open the file in the binary random access mode, seek to the middle, move forward sequentially till you reach \n or \n\r ascii, starting from the following character dump N lines to your rest file (one \n - one line). Job done.

If your file is sorted and you need data between two keys you use the above described method + bisection.

bobah
  • 18,364
  • 2
  • 37
  • 70