Efficient way to get n middle lines from a very big file

Question

I have a big file around 60GB.

I need to get n middle lines of the file. I am using a command with head and tail like

tail -m file |head -n >output.txt
where m,n are numbers

The general structure of the file is like below with set of records (comma separated columns.) Each line can be of different length(say max 5000 chars).

col1,col2,col3,col4...col10

Is there any other way that I can take n middle lines with less time, because the current command is taking lot of time to execute?

Can you tell us more about the data in your file like the general structure of the file. How are the lines separated? Max size of each line? so that we can try and traverse the memory to the required line directly? If your lines are not equal in length, we'll have to parse it character by character. In that case, you are already using the best possble way. — Ashis Kumar Sahoo, Dec 09 '13 at 07:33

score 14 · Answer 1 · answered Dec 09 '13 at 09:16

14

With sed you can at least remove the pipeline:

sed -n '600000,700000p' file > output.txt

will print lines 600000 through 700000.

answered Dec 09 '13 at 09:16

perreal

94,503
21
155
181

2

If there are a lot of lines _after_ the last requested line, it might help to also add a 'q' command: `sed -n '600000,700000p;700000q' file`. Otherwise, sed will keep running until the last line of the file is read (even if nothing is printed). – geronimo Mar 20 '19 at 14:03

score 8 · Answer 2 · answered Dec 09 '13 at 10:28

8

awk 'FNR>=n && FNR<=m'

followed by name of the file.

answered Dec 09 '13 at 10:28

Anitha Mani

863
1
9
17

score 3 · Answer 3 · edited Jun 20 '20 at 09:12

3

It might be more efficient to use the split utility, because with tail and head in pipe you scan some parts of the file twice.

Example

split -l <k> <file> <prefix>

Where k is the number of lines you want to have in each file, and the (optional) prefix is added to each output file name.

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 09 '13 at 09:20

Rajish

6,755
4
34
51

yes, I thought of using this command but my machine doesn't have much space to store the splitted files :( – Mahesh Dec 09 '13 at 10:04

score 0 · Answer 4 · answered Dec 09 '13 at 09:00

The only possible solution I can think of to speed up the search is to build and index of your lines, something like:

And then, knowing the index length, you could jump quickly in the middle of your data file (or wherever you like...). Of course you should keep the index updated when the file changes...

Obviously the canonical solution for such a problem would be to keep the data in a DB (see for example SQLite), an not in a plain file... :-)

My intention is to move this data to DB. Because few of the records are not in proper structure and due to some other issues, I am moving them to DB in chunks. — Mahesh, Dec 09 '13 at 09:10

score 0 · Answer 5 · answered Apr 24 '23 at 06:01

Having the same problem (mine is an Asterisk Master.csv file), I am affraid there is no trivial solution: when wanting to access the 10,000,000-th line of a file (file, not database record nor in memory representation of the file), whatever have to count from 0 to 10,000,000... :-(

score -2 · Answer 6 · answered Dec 09 '13 at 09:26

Open the file in the binary random access mode, seek to the middle, move forward sequentially till you reach \n or \n\r ascii, starting from the following character dump N lines to your rest file (one \n - one line). Job done.

If your file is sorted and you need data between two keys you use the above described method + bisection.

Efficient way to get n middle lines from a very big file

6 Answers6

Example