1

I have lots of small files. To save file handles and improve IO efficiency, these files are packed into a big single file. However, for some reason, these small files should be able to update in runtime. So Updating and reading different parts of a big single file at the same time by different threads is required.

Because of the memory limit, mmap is not a good choice. I have to implement it by myself. But I'm concerned about is it safe to read and write different parts of a single file at the same time on iOS/Android. How can I make sure the block which is being writing will not read by other thread.

Should I implement the whole feature by thread locks or there has been some mature technic to do the same work?

By the way, I use C++ for my project. But Java & Obj-C is also an option.

User case example:

My project is an RPG game. When people see an item that is not stored in the original package, the game will load it from the server and save it into the disk automatically and immediately.

One item corresponding to a single file. Each file almost 300KB~1.5MB. There are 3000~5000 items on the server. In the worst case, people will save thousands of files locally.

The good thing is my user can load the items on demand to save the storage. And when updating only changed items will be redownloaded. But thousands of files will lead to a high risk of running out of FD or other resources.

That's why I would like to pack these small files into a single big package file. But I still want to keep the ability to update/add a single file.

Yyao
  • 395
  • 3
  • 17
  • 1
    If you're set on using this approach yep locks are still a thing in C. But you don't even mention the language you're using, I'm going to assume `lseek` to make the file handler jumps. Have you considered using a database to organize your data? are you trying to optimize it before measuring how slow it is and knowing whether it will really impact your performance? Isn't mmap memory limit 4GB, in which case are you really planning on having such big file on disk? And why not use C api, available in both ios and android? – Fabio Jul 07 '20 at 00:14
  • @Fabio Thank you for replying. I haven't considered using a database to manage my data. My data is consists of 3000~5000 small files with an average size of 500KB. I have no experience of dealing with binary files with a database. Does it fit my cases? – Yyao Jul 09 '20 at 06:58
  • @Fabio For mmap, my project is memory-consuming. There is only a 10~20M memory budget for this feature. I thought mmap would take the same size memory of the file size on disk. I'm planning to make 3~4 500MB files to handle this small file. In the worst case, It will take 2GB memory. So, basically, if I use C API and keep all threads do not read/write the same block of a file, will my project work fine? – Yyao Jul 09 '20 at 06:58
  • how are you packing them? with zip? – M D P Jul 09 '20 at 07:41
  • that's not how mmap backed by a file works, it's virtual memory so doesn't really use RAM how you'd think. Now the bad news, iOS won't let you mmap over 700 MB https://stackoverflow.com/questions/13425558/why-does-mmap-fail-on-ios. Now on architecting a solution for your problem, pls edit your question with what do you expect your write/read frequency is, and if there's user interaction like tapping a button and expecting a particular binary (aka blob) to load immediately something on the screen. There's many options and the user interaction may driver the best solution. – Fabio Jul 09 '20 at 07:45
  • @MDP These files are not packed into one file yet in my project. And there is a high risk of running out of FD or performance. That's why I want to pack it into one file. But I still want to keep the ability to update or add a new file after packing and read other files in the package simultaneously. – Yyao Jul 09 '20 at 12:12
  • @Fabio I have updated some detail of my use case. Hope I made it clear : ) – Yyao Jul 09 '20 at 12:33
  • Well you can do all that through zip api – M D P Jul 09 '20 at 23:09
  • @MDP If my files are read-only, zip will be a great choice. But I'm not sure if I modified a file in zip and write it back to disk, would it lead to rearranging all files in zip again? – Yyao Jul 10 '20 at 01:24
  • No you don't have to repack it. see this answer: https://stackoverflow.com/a/17504151/2855059 – M D P Jul 11 '20 at 11:24

2 Answers2

1

In short yes, locks are still the best way to handle that and will forever keep being an important thing in devs' toolbelt.

This kind of problem is as common as there are approaches to solve it, almost making this answer opinion based. I'll sprinkle my opinions here and there, but you will need to chip in your own decisions based on what's best or easier for you.

First of all, managing a huge file with variable size, with many little things of variable size inside it and deleting and creating on the fly, using multiple threads, seems to me as complex as designing and implementing a file system. And I see no advantages compared to the below approaches - well, maybe it will be blazing fast. But trust me, you neither need nor want to go that route.

So I won't exactly answer your original question, instead I'd like to show you a less risky way to go around your problem.

For practical purposes I'll refer to the game items as asset. Also I'll be assuming these assets are not meant for being used directly by the GPU, such as textures, which may need a fresh take that I'm not experienced in.

=========

1- Network cache approach

  • find a library that caches network requests.
  • every time you need an asset you pretend you're getting it from the network, and it gives you a binary. If it's the first time it will ask it from the server, otherwise it's likely to find a copy in the library cache.

ups: very simple and quick to set up. Configure a cache size and the old objects are evicted based on LRU (least recently used). If server is set up properly your app knows if it has the latest version of the asset or there is a new one to be downloaded. And no need to care about locks and thread safety.

downs: can be very inefficient if you set up the cache strategy wrong and your server don't expose the caching headers correctly.

For this approach I can suggest Okhttp version 4, which is written in kotlin. It means you can have it running in android or iOS, and should be relatively easy to interface from C / C++/ Obj-C ( although I haven't tried it personally ), and trivial in java.

There are certainly other libs around, but I know no other one that can be used both in C and Java/JVM.

=========

2- track individual assets separately

You may need a central class to determine if the asset is available, not available, or downloading. You'll need it to eventually check for newer versions, and eventually to delete a couple of them to save space.

That's a lot of info to have in mind for each asset. I feel like the natural approach is to have a database for the purpose of tracking such state.

Now you have 2 options. You can store the asset in the database as a blob. Or get a unique filename, save it yourself to disk and store the filename in the database. I strongly suggest the latter, will make your debugging so much easier and way less risky.

Alternatively you can a class that is created when the app starts, scans available files and versions, and holds all that info in memory.

ups: store each asset individually, either as a file on disk or as a blob. You can keep track of how many times you used it, and come up with strategies to delete them if you want to. downs: choosing a database can take a long time. In particular, SQLite and RealmDb works in both android and iOS, so you can potentially share some stuff.

While reading for this answer I found this very interesting article that claims that on some OSs (including Android) reading stored small blobs (10kB) from sqlite is faster than reading from disk. Interesting surprise, but only marginally faster so not worth doing it just for this gain. Since reading multiple blobs in parallel may create a bottleneck on the db. https://www.sqlite.org/fasterthanfs.html

You only need as many file descriptors as assets being read from disk. After that, you should keep it in memory and close the fd?

===============

3- network cache, but with an in memory cache So this is an optimisation on top of (1) in case something gets too slow. But as with all performance optimisations I strongly suggest you measure before spending time on it. So in the end you KNOW how much time you saved, and if it's worth the extra maintenance after you're done and forget how it works.

Here you roll up a class that can hold, say, 50 assets in memory for very fast access. When it doesn't have the asset it asks for the network library.

ups: it's more performant than (1) and less complex than (2). downs: it's still more complex than (1).

================

1001 - big file and mmap

Why did I number this option as 1001? Because they're in the order I'd recommend, and I'd really not recommend this approach.

I've used mmap many years ago, so I hope I remember its details correctly. At best they apply only to linux with a 1 core processor where I used it, and pls verify that you get the same behavior on the platform you need.

If you create a 1GB file and mmap it you're not going to consume 1GB of RAM since that's only virtual memory. It does consume physical memory proportional to the amount of pages coming from page faults when you read/write to the file.

You don't need any locks to read or write to a mmaped file. Simply read and write to it, and you just have the next read mirroring the last write. Now, I've done this back in 2004 on old 1 core cpu computers. How do they behave in modern multi-core cpus, and how do you ensure that after core 1 writes to a memory position aka file region you can read the same value on core 2 instead of the previously written value? I have no idea and urge you to not implement this without learning it first.

You WILL need locks/semaphores and thread safety for you algorithm that allocates offset for each asset. When your game asks for an asset you need to determine if you have it on disk, which also implies you know where on disk it is. Let's call this "where" offset. And if it's not you need to decide where to store it, download it, and store the file offset somewhere. That's the bit of your code that is prone to race conditions.

ups: fast. But not really sure how much faster than the previous approaches. If you need an asset for the 1st time you still need to wait for a page fault, which will go read that file region from disk and load it in physical memory. downs: managing memory offsets and synchronizing page faults across cores will make you a better programmer, at the cost of a lot of time and tears. And by my experience I'm pretty sure something weird is going to happen on either ios or Android that doesn't behave like expected. Like Why does mmap fail on iOS?

https://medium.com/i0exception/memory-mapped-files-5e083e653b1

=================

1002 - big file and lseek

Yes, there is yet another approach that I not recommend even more. It's basically the above, but instead of reading and writing with mmap, you create one or multiple file descriptors for the same file, and use lseek to read/write the memory regions.

It has all the disadvantages as the previous option and at best the same advantages.

Fabio
  • 2,654
  • 16
  • 31
1

Former gamedev here.

Fabio gave a pretty good and detailed answer. He's absolutely right about options 1001 and 1002. I totally would NOT take that approach.

A combination of 1 and 3 would be my preferred combo. You set a cache size and as new files are added to the cache, remove older ones.

Depending on your game design (open world? game levels), you can have a preprocess that fetches all the files you need before a level (while showing a loading screen), and make sure they are available locally and download from the network if necessary. Re-reading your post, it appears you are already doing that?

But thousands of files will lead to a high risk of running out of FD or other resources.

You should not have the entire file system loaded at once. Only those assets which you are going to need for a particular level. If you need ALL files to be loaded at any one time, I would suggest to go back to the drawing board and relook at your design and architecture.

  • Thank you. Because the game is like an MMORPG, it's hard to tell which item will be loaded. People will load items from other players, and it's almost random. The preprocessing is not very suitable. In my current architecture, I only load 50 ~100 files at the same time. Files over the limit will be unloaded by time. But it cost a lot to maintain a system with thousands of resource files. Keep thousands of files in a folder or unload files at runtime is high risk. If other processes iterate the folder accidentally, it will lead to stuck. Unload files may cause resource missing or crash. – Yyao Jul 10 '20 at 08:45
  • Are you not using a server to centrally store all the files? A peer to peer system can get pretty hairy. I think you need 2 levels of cache: in memory and on disk. So let's say I just acquired a sword which I need to render in the game. I would need an id, eg: ID_OBJECT_POINTY_SWORD and a path location on disk. Step 1: Check if ID_OBJECT_POINTY_SWORD is loaded in memory. If it’s in memory, just grab the data blob and load it. Step 2: If it’s not in not in memory check if it’s on disk. If it’s on disk, load it in memory and cache it. – Fardin Elias Jul 10 '20 at 13:22
  • Step 3: If it’s not on disk, fetch from server. Save on disk, load in memory. Timestamp every file ID in the cache on when it was last used. You need to fix the size of both caches depending on your budget. Then you would need a periodic process that cleans up the cache based on the time stamp on when they were last used. That's as simple as I can conceptualise it. I would suggest to not try to overthink or over-engineer. – Fardin Elias Jul 10 '20 at 13:29