0

I have a .tar.gz file. It contains one 20GB-sized text file with 20.5 million lines. I cannot extract this file as a whole and save to disk. I must do either one of the following options:

  1. Specify a number of lines in each file - say, 1 million, - and get 21 files. This would be a preferred option.
  2. Extract a part of that file based on line numbers, that is, say, from 1000001 to 2000001, to get a file with 1M lines. I will have to repeat this step 21 times with different parameters, which is very bad.

Is it possible at all?

This answer - bash: extract only part of tar.gz archive - describes a different problem.

Community
  • 1
  • 1
lyrically wicked
  • 1,185
  • 12
  • 26

4 Answers4

1

You can use the --to-stdout (or -O) option in tar to send the output to stdout. Then use sed to specify which set of lines you want.

#!/bin/bash
l=1
inc=1000000
p=1
while test $l -lt 21000000; do
  e=$(($l+$inc))
  tar -xfz --to-stdout myfile.tar.gz file-to-extract.txt |
      sed -n -e "$l,$e p" > part$p.txt
  l=$(($l+$inc))
  p=$(($p+1))
done
Brad Lanam
  • 5,192
  • 2
  • 19
  • 29
  • I don't know bash, but this looks interesting, I will test this. So, does that mean that 20GB-sized file won't be saved on disk? – lyrically wicked Jan 07 '15 at 06:36
  • It will be saved on disk, but in pieces. You'll have to adjust the script to do some number of chunks at a time. As is, the script will try to create all 21 pieces. – Brad Lanam Jan 07 '15 at 06:48
  • Right. I made an assumption about not enough disk space, but rather you just have a problem saving as a whole. I would use John1024's solution then. Simpler. My solution could be used if there actually wasn't enough disk space and you couldn't split the file into all the pieces at once. – Brad Lanam Jan 07 '15 at 15:36
  • The situation when I don't have enough disk space is also possible, and in this case I'll have to limit the output to one file containing no more than predefined number of lines, so this answer is useful, thanks again. By the way, you can check my next question: [My .gz/.zip file contains a huge text file; without saving that file unpacked to disk, how to extract its lines that match a regular expression?](http://stackoverflow.com/questions/27834818/my-gz-zip-file-contains-a-huge-text-file-without-saving-that-file-unpacked-to) because I'd like to see a pure Bash solution, if it's possible – lyrically wicked Jan 08 '15 at 08:24
1

To extract a file from f.tar.gz and split it into files, each with no more than 1 million lines, use:

tar Oxzf f.tar.gz | split -l1000000

The above will name the output files by the default method. If you prefer the output files to be named prefix.nn where nn is a sequence number, then use:

tar Oxzf f.tar.gz |split -dl1000000 - prefix.

Under this approach:

  • The original file is never written to disk. tar reads from the .tar.gz file and pipes its contents to split which divides it up into pieces before writing the pieces to disk.

  • The .tar.gz file is read only once.

  • split, through its many options, has a great deal of flexibility.

Explanation

For the tar command:

  • O tells tar to send the output to stdout. This way we can pipe it to split without ever having to save the original file on disk.

  • x tells tar to extract the file (as opposed to, say, creating an archive).

  • z tells tar that the archive is in gzip format. On modern tars, this is optional

  • f tells tar to use, as input, the file name specified.

For the split command:

  • -l tells split to split files limited by number of lines (as opposed to, say, bytes).

  • -d tells split to use numeric suffixes for the output files.

  • - tells split to get its input from stdin

John1024
  • 109,961
  • 14
  • 137
  • 171
  • Is this supposed to extract that 20GB-sized single file to disk at first? If yes, this is not an answer. – lyrically wicked Jan 07 '15 at 06:33
  • @lyricallywicked __No, it does not.__ `tar` reads from the `.tar.gz` file and _pipes_ its contents to `split`. At no point is the whole file written to disk at once. – John1024 Jan 07 '15 at 06:36
  • Thank you! So, I'm going to test all the answers. This may take about a day! And I'll accept or comment, later – lyrically wicked Jan 07 '15 at 06:40
  • Nice one. I didn't know about `split` - this is like my answer but without writing any script at all! Do you happen to know which Unix systems have `split`? – John Zwinck Jan 07 '15 at 06:42
  • @JohnZwinck Thanks. There is also something new to discover in the Unix tool set. As for which systems support `split`, since `split` is part of [GNU Coreutils](http://www.gnu.org/software/coreutils/manual/coreutils.html#split-invocation), I expect that it will be on all varieties of Linux. [Mac OSX](https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man1/split.1.html) includes `split`. Also, `split` is required by [POSIX](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/split.html#tag_20_120), so it will be likely found on other systems as well. – John1024 Jan 08 '15 at 07:56
1

Here's a pure Bash solution for option #1, automatically splitting lines into multiple output files.

#!/usr/bin/env bash

set -eu

filenum=1
chunksize=1000000
ii=0
while read line
do
  if [ $ii -ge $chunksize ]
  then
    ii=0
    filenum=$(($filenum + 1))
    > out/file.$filenum
  fi

  echo $line >> out/file.$filenum
  ii=$(($ii + 1))
done

This will take any lines from stdin and create files like out/file.1 with the first million lines, out/file.2 with the second million lines, etc. Then all you need is to feed the input to the above script, like this:

tar xfzO big.tar.gz | ./split.sh

This will never save any intermediate file on disk, or even in memory. It is entirely a streaming solution. It's somewhat wasteful of time, but very efficient in terms of space. It's also very portable, and should work in shells other than Bash, and on ancient systems with little change.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • How should this solution look if I don't want to use an external `split.sh` file and I just want to paste the complete code to command-line? – lyrically wicked Jan 07 '15 at 06:53
  • Then put the `tar` right before the `while`, like this: `tar xfzO big.tar.gz | while read line`. – John Zwinck Jan 07 '15 at 06:54
  • please, check my next question: [My .gz/.zip file contains a huge text file; without saving that file unpacked to disk, how to extract its lines that match a regular expression?](http://stackoverflow.com/questions/27834818/my-gz-zip-file-contains-a-huge-text-file-without-saving-that-file-unpacked-to) – lyrically wicked Jan 08 '15 at 07:53
-2

you can use

sed -n 1,20p /Your/file/Path

Here you mention your first line number and the last line number I mean to say this could look like

sed -n 1,20p /Your/file/Path >> file1

and use start line number and end line number in a variable and use it accordingly.

smn_onrocks
  • 1,282
  • 1
  • 18
  • 33
  • I don't understand. Is this supposed to extract a specific part of a .txt file from a .tar.gz archive? Where is the name of archive, the name of an internal file, the name of an output file an path to save the output? I didn't test this answer, but I think it won't work. – lyrically wicked Jan 07 '15 at 06:21
  • First you unzip yor tar.gz file in a patha and then you mention the unzip file name 'abc.txt' like `sed -n 1,20p /Your/file/Path/abc.txt >> file1` – smn_onrocks Jan 07 '15 at 06:24
  • I said I can't unzip an archive - this will extract that 20GB-sized file as a whole, I need to extract a specific part of an internal file – lyrically wicked Jan 07 '15 at 06:27
  • using `tar -xvf` you should be able to unzip any size archive file. – smn_onrocks Jan 07 '15 at 06:29
  • The problem is I cannot save 20GB-sized file to disk – lyrically wicked Jan 07 '15 at 06:31
  • This would be a great answer to a different question, but this doesn't answer the question as posed. – CaffeineConnoisseur Sep 01 '16 at 23:58