10

This question came to mind when I was trying to solve this problem.

I have harddrive with capacity 120 GB, of which 100 GB is occupied by a single huge file. So 20 GB is still free.

My question is, how can we split this huge file into smaller ones, say 1 GB each? I see that if I had ~100 GB free space, probably it was possible with simple algorithm. But given only 20 GB free space, we can write upto 20 1GB files. I've no idea how to delete contents from the bigger file while reading from it.

Any solution?

It seems I've to truncate the file by 1 GB, once I finish writing one file, but that boils down to this queston:

Is it possible to truncate a part of a file? How exactly?

I would like to see an algorithm (or an outline of an algorithm) that works in C or C++ (preferably Standard C and C++), so I may know the lower level details. I'm not looking for a magic function, script or command that can do this job.

Community
  • 1
  • 1
Nawaz
  • 353,942
  • 115
  • 666
  • 851
  • 4
    you'd have to work from the **END** of the source file. split off a 1gig chunk, truncate the source file by 1gig, etc... you can't do it from the front, as that's require you to copy the entire file and you'd run out of space. – Marc B Apr 11 '13 at 03:21
  • @MarcB: Is it possible to truncate part of a file? How exactly? – Nawaz Apr 11 '13 at 03:22
  • sure, but it requires copying the "wanted" parts and skip the unwanted parts. you cannot simply "delete" from the middle of a file. – Marc B Apr 11 '13 at 03:22
  • @MarcB: But how exactly do you truncate 1 GB, once you read 1GB part from it? – Nawaz Apr 11 '13 at 03:24
  • how about using `truncate` linux command ? `truncate -s -100000000 your_big_file.data` ( assume you are working in Linux... ) – Raptor Apr 11 '13 at 03:27
  • @ShivanRaptor: I've no idea of that command. How does it do what it does? Does it truncate a part of a file? Does it write a new file, deleting the old one? or what? – Nawaz Apr 11 '13 at 03:29
  • You know, `dd` and a bit of scripting would do a smashing job here. C++ can do this job, but it's massive overkill. Do you really want to compile a throw-away script? – tadman Apr 11 '13 at 03:30
  • @MarcB: How does `ftruncate()` work? Does it need more space? – Nawaz Apr 11 '13 at 03:30
  • @Nawaz: `truncate` is a Linux command that truncate the file to specific file length. ( http://manpages.ubuntu.com/manpages/lucid/man2/truncate.2.html ) – Raptor Apr 11 '13 at 03:30
  • @ShivanRaptor: You already said that it is Linux command. My question was more specific about the constraints which we have here. – Nawaz Apr 11 '13 at 03:31
  • @Nawaz: truncate makes the file shorter by removing the end. So you'd have to work backwards through the file. – rici Apr 11 '13 at 03:32
  • 2
    @nawaz: sorry, but I'd assumed that someone with 110k rep would be able to google for ftruncate() info themselves... – Marc B Apr 11 '13 at 03:33
  • 1
    You have to be careful when sticking only to standard functions. More often than not, I've had to resort to non-standard extensions to properly handle files > 4GB. – Mysticial Apr 11 '13 at 03:37
  • Spend $50 on a bigger disk and use bsplit command. – brian beuning Apr 11 '13 at 03:38
  • 2
    @brianbeuning: One can increase the size of the bigger file, and ask the same question. So your $50 cannot help there. – Nawaz Apr 11 '13 at 03:41
  • will you try merge sorting? – fatihk Apr 11 '13 at 05:24
  • @MarcB: it is not required to process from the END of the file, although doing so would be the most efficient. It is possible to process from the START of the file, or even the MIDDLE, without making an unnecessary copy that wastes disc space. Read data from the original file to the target file, up to 1GB, then copy the remaining data of the original file from the stopping point to the starting point overwriting the original data and then truncate the original file by the number of extracted bytes. No third file needed. It is slower, but it works when space is more limited than time. – Remy Lebeau Apr 11 '13 at 05:57
  • @RemyLebeau: I think when you attempt to overwrite the original file, reading from the offset 1GB, it will truncate the entire file. Means, this will not work. – Nawaz Apr 11 '13 at 07:19
  • @MarcB: No, reading the bytes at offset 1GB+N and writing those bytes to offset 0+N of the same file will not truncate the file while reading/writing. – Remy Lebeau Apr 11 '13 at 07:57
  • @RemyLebeau: It will truncate the file. Please try it. – Nawaz Apr 11 '13 at 08:08
  • 1
    @Nawaz: if it's truncating it, you're opening it with the wrong flags. – Wug Apr 11 '13 at 09:03
  • @Wug: Then please tell me what flags should I use? – Nawaz Apr 11 '13 at 09:29
  • @Nawaz: or opening it using multiple handles/streams that use conflicting flags. Open the file only once, for read/write non-truncating access, such as with `fopen("rb+")`. Seek to an offset, read some bytes, seek backwards, write the bytes, seek forwards to the next offset, read some more bytes, seek backwards, write the bytes, and so on. Then truncate the file only when finished. – Remy Lebeau Apr 11 '13 at 15:15
  • @remy: you're looking at it wrong. 100gig file -> 100x1gig chunks, without enough disk space to have the 100+100x1 = 200gig required to do it. so... read chunk 99->100 of the file, write out file "chunk100". chop 1gig off the end of the original file, so it's 99gig. then read chunk 98->98 into its own chunk, repeat. Total disk usage throughout, 101gig, never exceeding the limits. op doesn't want to overwrite anything, just split the original into multiple smaller chunks without exceeding the disk limit. – Marc B Apr 11 '13 at 16:31
  • @Marc: You just confirmed what I was describing, contradicting your earlier comment. You are the one who said "you can't do it from the front, as that's require you to copy the entire file and you'd run out of space", and that is not true at all. It is possible, and I described how to do it. You DO NOT need to copy the entire file in order to split the file from the front, or even the middle. The OP does want to "delete contents from the bigger file while reading from it", which involves overwriting the file's data inline. – Remy Lebeau Apr 11 '13 at 17:23
  • @remy: yes, but you're requiring double the file copies. copy block 96->97 to position 0, then copy block 0->1 to a new file. e.g. copy 200gig of data instead of 100gig. – Marc B Apr 11 '13 at 17:25
  • @MarcB: Copy data from positions 0-1GB to a chunk file, copy data from positions 1GB-EOF to position 0, truncate the file by 1GB, copy data from positions 0-1GB to a chunk file, copy data from positions 1GB-EOF to position 0, truncate the file by 1GB, and repeat as needed until all data is read. That is the only way to maintain the 101GB total HDD usage during the splitting process. I didn't say it was processing-efficient, just space-efficient. – Remy Lebeau Apr 11 '13 at 19:15

2 Answers2

5

There is no standard function for this job.

For Linux you can use the ftruncate method, while for Windows you can use _chsize or SetEndOfFile. A simple #ifdef will make it cross-platform. Also read this Q&A.

Community
  • 1
  • 1
huysentruitw
  • 27,376
  • 9
  • 90
  • 133
  • I edited my question. Now it says *"I would like to see an algorithm (or an outline of an algorithm) that works in Standard C or C++, **so I may know the lower level details**. I don't want just some solution, scripts or command that can do this job."* – Nawaz Apr 11 '13 at 03:36
  • 2
    @Nawaz: There are no standard C++ facilities that allow you to do this reliably across platforms. You need to rely either on os-specific functions, like `_chsize` and `ftruncate`, or on third-party libraries, like `boost::filesystem`, which wrap those function calls. At least until `filesystem` is adopted into the standard, which should be soon. – Benjamin Lindley Apr 11 '13 at 03:49
  • @Nawaz: What do you mean by lower level details? You want me to explain how you can open a handle to the hard-driver and follow the file-cluster-chain yourself using a FAT32 library so you can truncate the chain yourself? If that is the case, then you are even further away from cross-platform and standards. – huysentruitw Apr 11 '13 at 08:29
  • @WouterHuysentruit: I meant if I were to implement `ftruncate` functionality myself, what would I be doing? What would be the algorithm? – Nawaz Apr 11 '13 at 08:51
  • Just follow the file-cluster-chain and modify the length of the cluster where truncated should happen and update the allocation table. Very simple but file system dependent. – huysentruitw Apr 24 '13 at 04:44
5

According to this question (Partially truncating a stream) you should be able to use, on a system that is POSIX compliant, a call to int ftruncate(int fildes, off_t length) to resize an existing file.

Modern implementations will probably resize the file "in place" (though this is unspecified in the documentation). The only gotcha is that you may have to do some extra work to ensure that off_t is a 64 bit type (provisions exist within the POSIX standard for 32 bit off_t types).

You should take steps to handle error conditions, just in case it fails for some reason, since obviously, any serious failure could result in the loss of your 100GB file.

Pseudocode (assume, and take steps to ensure, all data types are large enough to avoid overflows):

open (string filename) // opens a file, returns a file descriptor
file_size (descriptor file) // returns the absolute size of the specified file
seek (descriptor file, position p) // moves the caret to specified absolute point
copy_to_new_file (descriptor file, string newname)
// creates file specified by newname, copies data from specified file descriptor
// into newfile until EOF is reached

set descriptor = open ("MyHugeFile")
set gigabyte = 2^30 // 1024 * 1024 * 1024 bytes

set filesize = file_size(descriptor)
set blocks = (filesize + gigabyte - 1) / gigabyte

loop (i = blocks; i > 0; --i)
    set truncpos = gigabyte * (i - 1)
    seek (descriptor, truncpos)
    copy_to_new_file (descriptor, "MyHugeFile" + i))
    ftruncate (descriptor, truncpos)

Obviously some of this pseudocode is analogous to functions found in the standard library. In other cases, you will have to write your own.

Community
  • 1
  • 1
Wug
  • 12,956
  • 4
  • 34
  • 54
  • 2
    In addition to your answer, and if it's not already obvious, I would suggest that the developer does a TEST RUN on another system prior to the real deal to make sure there are no silly bugs in their code... When you only have one shot at it, you wanna make sure you don't miss. =) – paddy Apr 11 '13 at 04:20