This is actually a generic programming question not specific to python, so I will answer generically. This applies to practically all major languages.
The Simple File Table
A typical way to do this is to have the beginning of your created file contain a map of all the files within it.
Since you can write to (and later read from) a binary file in whatever way you choose, you can set this map up however you want. You just have to decide how you want the data to represent your file structure.
A relatively simple way might be to use the first few bytes in the file as a counter (I'll call it N) which tells you how many source files there are total contained within your packed file, then use the next N*4 or N*8 bytes to represent locations within your packed-file where each source-file can be found. Then after that 4+N*4 (or 8+N*8, or whatever) bytes, next you can put the files one at a time. If you want to include a filename you can put the name at this location just before the file data.
A use case:
I have 5 source-files I want to store into your big packed-file:
- stuff/a/hi.txt (contains "Hello world!")
- stuff/a/bye.txt (contains "Goodbye world!")
- stuff/b/abc.txt (contains "abcdefg")
- stuff/b/def.txt (contains "gfedcba")
- stuff/b/ghi.txt (contains "mnopqrstuv")
I could save the data like this:
(Note: first number in each bullet-point below is the location within the file, how many bytes into the file that data is, and I'm assuming 4-byte integers so the numbers will take up 4 bytes, also assumed is a 1-letter = 1-byte text string format which might not be the case)
- 0: 5 (# files: next 5*4 bytes used for file locations)
- 4: 24 (hi.txt location)
- 8: 50 (bye.txt location)
- 12: 79 (abc.txt location)
- 16: 101 (def.txt location)
- 20: 123 (ghi.txt location)
- 24: 6 (how many bytes are used for the following filename)
- 28: "hi.txt" (the filename data stored in the binary file)
- 34: 12 (file length)
- 38: "Hello world!" (The file's data packed into your packed-file)
- 50: 7 (file #2's filename length)
- 54: "bye.txt" (#2's filename)
- 61: 14 (file length)
- 65: "Goodbye world!"
- 79: 7 (#3 filename length)
- 83: "abc.txt"
- 90: 7 (data length)
- 94: "abcdefg"
- 101: 7 (#4 filename length)
- 105: "def.txt"
- 112: 7 (data length)
- 116: "gfedcba"
- 123: 7 (#5 filename length)
- 127: "ghi.txt"
- 134: 10 (data length)
- 138: "mnopqrstuv"
- 148: no more data, 147 was the last file location
To get a listing of what files are inside your packed file, you would simply do (pseudocode):
number-files = read the first 4-byte integer
N = 1..number-files
location[N] = read a 4-byte integer from location N
name-length[N] = read a 4-byte integer from location location[N]
file-names[N] = read text string: name-length[N] bytes at location[N]+4
Then, to read a file's data, you would do (N is a number):
N = pick-a-file-any-file(file-names)
file-location = read integer at location N
file-data-location = file-location + 4 + (read integer at location file-location)
data-length = read integer at file-data-location
data = read data-length bytes from location (file-data-location + 4)
And your file data for the chosen file will be in "data".
Improvements
NOTE: This is not the most efficient way to handle a table of files. This is merely what I believe is probably the easiest to understand and to follow the logic.
There are a number of more efficient ways to handle the file structure. How you optimize your table depends on what you are optimizing for. If you need maximum speed in browsing a huge table of files, then you could have the following in the table:
- number of files (N)
- N pairs of (filename-location, file-data-location)
- N filenames
- N file-datas
This improves locality (a property of how close together similar pieces of data are in memory or on a disk) which can speed up access times, and it also makes it easier to go from having file number X to looking up either X's name or data.
Another improvement some systems use is to only start the data for a file beginning at a location divisible by 4096: so file 1's data could be at location 4096, file 2's data at 8192, etc.) This can provide a boost because of the way data is read from disk.
Of course, if you have lots of small files then you don't want them all to have an alignment based on 4096, or your file could be thousands of times larger than necessary. So another improvement would be to have different sections in your file where some of them are 4096-aligned and some are not. In the use case above with the 5 small file's, the biggest of which was 10 bytes, you would want them all in the same disk sector. But if you had any large files it would make sense to align them to a 4096-divisible location.
What do you do with the rest of the space between the end of 1 file and the beginning of the next (ie: if you save "abcdefg" at 4096, and gfedcba at 8192, then what do you do with the bytse from 4104 to 8191)? That's dead space, you can just set it all to a bunch of zeros. Yes, it gets wasted, which is why you don't want to do this for small files, only large files.
Updating the file
Let's say you've written the file, then you change abc.txt on your computer to contain "abcdefghijklmnop" and you want to update your packed-file that you've created. Now what? "abcdefghijklmnop" does not fit in the space you reserved for it in the packed-file, so you can't just insert it.
If you want to edit the file in place, then you need to save the data in a way that makes it easy to accommodate changes like this. For example, if we had aligned file data to 4096-divisible locations like mentioned earlier, then we would have plenty of dead space left over, plenty of space to accommodate this change.
Alternatively, you could rearrange some of the data to accommodate it. You could move the next file to the end of the file and use some of its space. This is getting complicated fast, isn't it?
An easy way to update the file is to just recreate it every time you update it; don't even bother trying to change a small part of the file, just overwrite the entire thing every time. For huge file structures, this could take a lot of time, but for small ones that don't need to scale well it works.
Defragmenting
For the previous section, if you chose to go the hard way and update the file structure only in the places where it has changed instead of rewriting the entire thing, then good for you but it gets still worse...
If you move thing around in your file enough, eventually file data could be all over the place and you could have a bunch of dead space if you move stuff around enough. This can eventually produce a lot of overhead.
If you want to continue optimizing down that path, the next step would be to either improve your algorithm for moving things around, or implement a defragmenting algorithm that will fix the file's inefficiencies.
As you can see, optimizing down this path just gets to be more and more of a headache. So generally it's not done if it's not necessary.
Summary
You need to decide what your needs are, which depends on why you are doing this. If you just have some small project, like a personal project or a class homework, then something like the simple method in the first section is sufficient, possibly with a couple small optimizations if you grasp it well enough, and you can just rewrite the packed-file every time you update it.