How can I pack multiple files in a folder to one binary file and get the offset of every file?

Question

I have many files in my folder and I want to pack all of them into one binary file and get the offset of every file within it. Could someone help me? My code for now (Very poor code, btw I am new in Python):

import os
import os.path

for dirpath, dirnames, filenames in os.walk("."):
    for filename in [f for f in filenames if f.endswith(".RTON")]:
        print(os.listdir(os.path.join(filename)))
        a = os.path.join(filename)

For example: In my folder, I have 100 files and want to pack them together and get the offset of every file.

packed_file = file1 + file2 + file3...+ file100

The resulting binary file should be 1 file which contains the data of all the other files that get packed into it. An analogy would be a tarball

See: https://stackoverflow.com/questions/13613336/python-concatenate-text-files, and https://stackoverflow.com/questions/2104080/how-to-check-file-size-in-python — shay, Dec 12 '19 at 20:48
Sorry, this is no clearer than your [previous attempt](https://stackoverflow.com/questions/59310580/how-can-i-pack-many-files-to-one-binary-file-and-get-the-offset-of-every-file). You should really [edit] your question to add information, not just create a new post. — Martijn Pieters, Dec 12 '19 at 22:05
@MartijnPieters It looks clear to me, hence my answer. Maybe we're reading the question different. Perhaps we could help OP. I'll make an edit based on what appears obvious the way I'm reading it. Maybe you can either suggest an improvement or ask for clarification about what is unclear to you, if you can put your finger on the problem. — Loduwijk, Dec 12 '19 at 23:03
@A770 Check out my suggested edit. It's not visible to everyone unless it is approved. You can approve it if you think it helps clarify your question. I assume I've captured your intent, but please clarify about any points on which I may be mistaken. — Loduwijk, Dec 12 '19 at 23:08
@A770 The same goes for my answer. Again, I assume I've understood your intent, but if my answer is mistaken then please clarify as requested by Martijn. — Loduwijk, Dec 12 '19 at 23:10
Thank you very much! @Loduwijk. I understand what you wrote. I didn't understand the code you wrote, I mean what is "read the first 4-byte integer"? Thanks in advance! :) — A770, Dec 14 '19 at 20:35
If you want to work with binary files as you suggested, then you need to study how data is represented and stored. 1) An integer is a "whole number" (1, 2, 5, 10, but not 1.5 and not 7.2). 2) 1 byte of computer information can only keep track of integers up to 255, so larger integers need more bytes (2 bytes for 65535, 3 bytes for 16-million, 4 bytes for 4-billion, etc.) 3) "the first 4-byte integer" is "the first 4 bytes of data in the file, which we will interpret as an integer, the number of files" — Loduwijk, Dec 15 '19 at 17:12

score 1 · Answer 1 · answered Dec 12 '19 at 22:05

This is actually a generic programming question not specific to python, so I will answer generically. This applies to practically all major languages.

The Simple File Table

A typical way to do this is to have the beginning of your created file contain a map of all the files within it.

Since you can write to (and later read from) a binary file in whatever way you choose, you can set this map up however you want. You just have to decide how you want the data to represent your file structure.

A relatively simple way might be to use the first few bytes in the file as a counter (I'll call it N) which tells you how many source files there are total contained within your packed file, then use the next N*4 or N*8 bytes to represent locations within your packed-file where each source-file can be found. Then after that 4+N*4 (or 8+N*8, or whatever) bytes, next you can put the files one at a time. If you want to include a filename you can put the name at this location just before the file data.

A use case:

I have 5 source-files I want to store into your big packed-file:

stuff/a/hi.txt (contains "Hello world!")
stuff/a/bye.txt (contains "Goodbye world!")
stuff/b/abc.txt (contains "abcdefg")
stuff/b/def.txt (contains "gfedcba")
stuff/b/ghi.txt (contains "mnopqrstuv")

I could save the data like this: (Note: first number in each bullet-point below is the location within the file, how many bytes into the file that data is, and I'm assuming 4-byte integers so the numbers will take up 4 bytes, also assumed is a 1-letter = 1-byte text string format which might not be the case)

0: 5 (# files: next 5*4 bytes used for file locations)
4: 24 (hi.txt location)
8: 50 (bye.txt location)
12: 79 (abc.txt location)
16: 101 (def.txt location)
20: 123 (ghi.txt location)
24: 6 (how many bytes are used for the following filename)
28: "hi.txt" (the filename data stored in the binary file)
34: 12 (file length)
38: "Hello world!" (The file's data packed into your packed-file)
50: 7 (file #2's filename length)
54: "bye.txt" (#2's filename)
61: 14 (file length)
65: "Goodbye world!"
79: 7 (#3 filename length)
83: "abc.txt"
90: 7 (data length)
94: "abcdefg"
101: 7 (#4 filename length)
105: "def.txt"
112: 7 (data length)
116: "gfedcba"
123: 7 (#5 filename length)
127: "ghi.txt"
134: 10 (data length)
138: "mnopqrstuv"
148: no more data, 147 was the last file location

To get a listing of what files are inside your packed file, you would simply do (pseudocode):

number-files = read the first 4-byte integer
N = 1..number-files
    location[N] = read a 4-byte integer from location N
    name-length[N] = read a 4-byte integer from location location[N]
    file-names[N] = read text string: name-length[N] bytes at location[N]+4

Then, to read a file's data, you would do (N is a number):

N = pick-a-file-any-file(file-names)
file-location = read integer at location N
file-data-location = file-location + 4 + (read integer at location file-location)
data-length = read integer at file-data-location
data = read data-length bytes from location (file-data-location + 4)

And your file data for the chosen file will be in "data".

Improvements

NOTE: This is not the most efficient way to handle a table of files. This is merely what I believe is probably the easiest to understand and to follow the logic.

There are a number of more efficient ways to handle the file structure. How you optimize your table depends on what you are optimizing for. If you need maximum speed in browsing a huge table of files, then you could have the following in the table:

number of files (N)
N pairs of (filename-location, file-data-location)
N filenames
N file-datas

This improves locality (a property of how close together similar pieces of data are in memory or on a disk) which can speed up access times, and it also makes it easier to go from having file number X to looking up either X's name or data.

Another improvement some systems use is to only start the data for a file beginning at a location divisible by 4096: so file 1's data could be at location 4096, file 2's data at 8192, etc.) This can provide a boost because of the way data is read from disk.

Of course, if you have lots of small files then you don't want them all to have an alignment based on 4096, or your file could be thousands of times larger than necessary. So another improvement would be to have different sections in your file where some of them are 4096-aligned and some are not. In the use case above with the 5 small file's, the biggest of which was 10 bytes, you would want them all in the same disk sector. But if you had any large files it would make sense to align them to a 4096-divisible location.

What do you do with the rest of the space between the end of 1 file and the beginning of the next (ie: if you save "abcdefg" at 4096, and gfedcba at 8192, then what do you do with the bytse from 4104 to 8191)? That's dead space, you can just set it all to a bunch of zeros. Yes, it gets wasted, which is why you don't want to do this for small files, only large files.

Updating the file

Let's say you've written the file, then you change abc.txt on your computer to contain "abcdefghijklmnop" and you want to update your packed-file that you've created. Now what? "abcdefghijklmnop" does not fit in the space you reserved for it in the packed-file, so you can't just insert it.

If you want to edit the file in place, then you need to save the data in a way that makes it easy to accommodate changes like this. For example, if we had aligned file data to 4096-divisible locations like mentioned earlier, then we would have plenty of dead space left over, plenty of space to accommodate this change.

Alternatively, you could rearrange some of the data to accommodate it. You could move the next file to the end of the file and use some of its space. This is getting complicated fast, isn't it?

An easy way to update the file is to just recreate it every time you update it; don't even bother trying to change a small part of the file, just overwrite the entire thing every time. For huge file structures, this could take a lot of time, but for small ones that don't need to scale well it works.

Defragmenting

For the previous section, if you chose to go the hard way and update the file structure only in the places where it has changed instead of rewriting the entire thing, then good for you but it gets still worse...

If you move thing around in your file enough, eventually file data could be all over the place and you could have a bunch of dead space if you move stuff around enough. This can eventually produce a lot of overhead.

If you want to continue optimizing down that path, the next step would be to either improve your algorithm for moving things around, or implement a defragmenting algorithm that will fix the file's inefficiencies.

As you can see, optimizing down this path just gets to be more and more of a headache. So generally it's not done if it's not necessary.

Summary

You need to decide what your needs are, which depends on why you are doing this. If you just have some small project, like a personal project or a class homework, then something like the simple method in the first section is sufficient, possibly with a couple small optimizations if you grasp it well enough, and you can just rewrite the packed-file every time you update it.

This post was rightfully closed as too broad, this is a *very long answer* for such an unclear question. — Martijn Pieters, Dec 12 '19 at 22:07
@MartijnPieters I disagree about it being too broad; it appears to me to ask a very specific question. You are free to disagree - I'm just glad I clicked submit seconds before it get closed, as it would be frustrating to put that effort in for nothing. About it being very long: That's because it's very thorough, including even a byte-by-byte explanation of a sample file. The extra sections are just icing to show that there's a lot involved in real software-engineering situations. And if you omitted the bullet list the main section isn't very long at all. — Loduwijk, Dec 12 '19 at 22:56