How to read large file on computer with limited RAM under Windows 7? The file size is 25GB but RAM capasity is 16GB only

Question

I need to develop an application that read a file containing graph of highways of Russia. On the base of content of the file an application must detect the shortest route between two specified towns. The application has to be written in unmanaged code in C++. I need to develop this application in MS VS 2013 as C++ console application without MFC support. There is a Windows 7 operating system on the custoner computer. As a search engine - "A*" algorithm must be used. My problem is the following. File containing graph of highways has a size of 25GB but the capacity of RAM on customer computer is 16GB only and no opportunity to extend it. Is there any programming technology for unmanaged C++ that I can use in this case to process large file? I have in mind in the case where the size of read file is larger than RAM capasity on the computer. In what manner I should design an application architecture in this case?

As long as it's a 64-bit program on a 64-bit OS, you'll be fine. Virtual memory will kick in automatically. Performance will be slow though. Ideally you'd break the problem into sub-problems that would all use much less memory individually. — Mark Ransom, Jul 15 '14 at 05:25
Your 25 GB file contains probably much more information than necessary for the computation of the shortest way between A and B. Maybe simplifying the in formation in that file while reading it is an option. — Jabberwocky, Jul 15 '14 at 05:38
you can always devise a garbage collected chunk manager that will stream-in blocks on demand, virtualizing your file access, and after a while of not being accessed, blocks will get de-allocated... — v.oddou, Jul 15 '14 at 05:40
Guys, my customer has a 32-bit computer with 32-bit OS. I bag your pardon but I forgot to write about this detail in the question. — user3769902, Jul 15 '14 at 07:04
Can anybody tell me if I assign to 3-rd, 4-th and 5-th parameter of MapViewOfFile() the value of 0 to get full proection of 25GB file, so the work with the proection in application will be very slow ? — user3769902, Jul 15 '14 at 14:20
If the OS is only 32 bit then you're wasting most of that memory. It can't possibly use more than 4GB. — Mark Ransom, Jul 16 '14 at 16:00

score 2 · Answer 1 · answered Jul 15 '14 at 06:01

2

You have to call CreateFileMapping on the file handle and then MapViewOfFile on the mapping handle. It is very convenient and allows you access the whole of the file without reading the file. Your target must be 64-bit in this case...

answered Jul 15 '14 at 06:01

Malkocoglu

2,522
2
26
32

This will certainly work, but the performance will be at the mercy of the disk IO speed and the spatial nature of the data on the file - if it's in a small physical range where it will all be paged into memory together performance should be good, but you could be thrashing the disk and go very slow if your graph algorithm and the data layout results in a lot of widely spaced random IO. – Brad Jul 15 '14 at 06:33
+1 FileMap is a feasible solution see my question http://stackoverflow.com/questions/2171625/how-to-scan-through-really-huge-files-on-disk – Jichao Jul 15 '14 at 06:49
Malkocoglu wrote about CreateFileMapping and MapViewOfFile calls. But it can have success only in case of 64-bit computer. But what about the case of 32-bit computer? – user3769902 Jul 15 '14 at 07:00
Your file is too big to be mapped for a 32-bit process. 32-bit pointers can not address a 25 GB file buffer ! – Malkocoglu Jul 15 '14 at 07:26
Probably - yes. De bene esse I examine advice about FileMap as Jichao says. – user3769902 Jul 15 '14 at 08:07
To Jichao: Please say what the parameters: "LPBYTE pbPattern, UINT cbPattern and LONGLONG * pcFound" in function ScanForPattern (in stackoverflow.com/questions/2171625/) designete? – user3769902 Jul 15 '14 at 08:52

score 2 · Answer 2 · answered Jul 15 '14 at 06:46

2

Using std::readline, you can read a file one line at a time. 640 kB RAM would be enough ;)

I'm pretty sure it's a text file, possibly even XML. In that case you'd use a dedicated "SAX" XML parser. I know it's not binary because I know that you can fit the entire map of Europe (highways and all minor roads too) in under 8 GB.

BTW, A* is obsolete. Modern routing algorithms such as ArcFlags are much faster.

answered Jul 15 '14 at 06:46

MSalters

173,980
10
155
350

It's more a comment than a real answer, isn't it ? If you read one line at a time, you'll have to wait a lot before your route is found ;-) But many thanks for mentionning ArcFlags (http://i11www.iti.uni-karlsruhe.de/_media/teaching/theses/files/da-baum-11.pdf) which could indeed help our Russian friend: it uses smaller precomputed partitions of the global graph and these are mor subject to in-memory handling ! – Christophe Jul 15 '14 at 18:06
I'm in need of building (or design) of architecture of application in case when file on disk is larger than computer's RAM. This is my first problem! Let me remain "A*" for search. – user3769902 Jul 15 '14 at 18:39
Can anybody give me advice according to architecture of application design when graph file is very large (as in my case) and its size is larger than RAM capasity? – user3769902 Jul 15 '14 at 18:47
@user3769902: As already noted, parse the file as you read it and you will not need 25 GB in memory. – MSalters Jul 16 '14 at 08:06

score 0 · Answer 3 · answered Jul 15 '14 at 05:45

0

Hm... why not to just read file by small portions? Lets say, by 128 records from file. You can create some class, which will hide inside this mechanism, so other parts of program don't admit difference between full loading and this method. Other way - map file to memory.

Btw, are u from Russia?

answered Jul 15 '14 at 05:45

Green_Wizard

795
5
11

Yes I'm. I'm native Russian. I live in Samara, this is the city on the left bank of Volga river. – user3769902 Jul 15 '14 at 07:06
You have in mind to read file by 128-record portions. But in that case each portion must contain information about direct neighbours of current node. It is necessary to "A*" for definition of cost of path from issue node to goal node. – user3769902 Jul 15 '14 at 07:14

score 0 · Answer 4 · answered Jul 15 '14 at 18:54

Your question is very broad ! The main problem you'll face is that loading/unloading chunks of data can make your search extreamly slow, especially if adjacent nodes are in different chunks.

As your graph is real world geography, your A* heuristic will certainly be mathematical distance btw points. This means that your algorithm will tend to develop path primarily based on the distance to target, and you could use this to optimize loading/unloading:

Here some hints:

if you could organise your data file by grouping data by geographic squares, you could at least reduce a little bit the loading/unloading cost.
think of an in-memory index of the fil offset of each square, so that you could access faster to the new chunks to load.
you could also combine this approach with caching. Instead of loading big squares into one chunk of maximum size, better load smaller squares into several smaller memory chunks that you manage with a caching algorithm (least recently used gets unloaded): As many geographic areas are not needed they will be replaced by most often used nodes (i.e. nodes around highways).
with a little creativity, you could perhaps also deviate slightly from standard A*, by adding a little bit of "beaming": Instead of always taking the best path to expand, take into consideration that it could sometimes be more efficient to expand paths that remain in the same memory chunk.

If I specify (in MapViewOfFile call) to project a full 25GB file in process address space (if it is possible), this is substantially to slow the work of application? — user3769902, Jul 15 '14 at 20:13
With the MapViewOfFile(), the system makes your file accessible trough virtual memory. It will manage access through memory paging, swaping in and out pages that are less needed (like caching alogrithm for chunks). But if your file is not organised to take advantage of the geographical knowledge, the MapView will have to swap in and out much more than needed. You should have a look at this: http://stackoverflow.com/questions/1880714/createfilemapping-mapviewoffile-how-to-avoid-holding-up-the-system-memory — Christophe, Jul 15 '14 at 21:10

score 0 · Answer 5 · answered Dec 21 '14 at 14:18

Not sure if this is still an open issue, just saw this question.

A 32-bit application is perfectly capable of reading/writing files > 4GB.

You do not even need to create a file mapping for it, unless you absolutely must have (the equivalent of) an in-memory representation of the data. If that is the case, it would be a bit more work but still doable (you could create a class to that wraps a mapped view of a file and handles "chunking" the data into sizes that fit into your available address space). As noted by others, you cannot map a file larger than the address space at the same time. Also, the comments and caveats in previous answers/comments regarding performance considerations would apply.

A few questions: How is the file indexed (if it is)? Can you post an example of the index layout and a record if it has an index? Even if not indexed, you could create your own index and access records through your wrapper class. Can you post an example of an individual record?

If you do not need an in-memory representation, a better option - certainly from a performance standpoint - would be to just create a class that handles reads from/writes to the file and use SetFilePointerEx to seek before any reads. If the file has an index that would fit into the visible address space, you could even just map that portion of the file into memory and do seeks/reads as necessary for the individual records (that are not mapped). A more complex approach could also handle caching records on an MRU (or some other) basis to help with performance.

How to read large file on computer with limited RAM under Windows 7? The file size is 25GB but RAM capasity is 16GB only

5 Answers5