0

I'm trying to work with nasty large xml and text documents: ~40GBs. I'm using Visual Studio 2012 on Windows 7.

I'm going to use 'Xerces' to snag the header/'footer tag' from the xmls.

I want to map an area of the file, say.. 60-120MBs.

Split the Map into (3 * processors/cores) equal parts. Setting each part as a buffer and loading the buffers into an array.

Then using (#processors/cores) while statments in new threads, I will synchronously count characters/lines/xml cycles while chewing through the the buffer array. When one buffer is completed the the process will jump to the next 'available' buffer and the completed buffer will be dropped out of memory. At the end I will add the total results into a project log.

Afterwards, I will reference the log, Split the files by character count/size(Or other option) to the nearest line or cycle and drop in the header and 'footer tag' to all the splits.

I'm doing this so I can import massive data to a MySQL server over a network with multiple computers.

My Question is, how do I create the buffer array and the file map with new threads?

Can I use :

win CreateFile

win CreateFileMapping

win MapViewOfFile

with standard ifstream operations and char buffers or should I opt something else?

Futher clarification: My thinking is that if I can have the hard drive streaming the file into memory from one place and in one direction that I can use the full processing power of the machine to chew through seperate but equal buffers.

~Flavor: It's kind of like being a Shepard trying to scoop food out from one huge bin with 3-6 Large buckets with only two arms for X sheep that need to stay inside the fenced area. But they all move at the speed of light.

A few ideas or pointers might help me along here. Any thoughts are Most Welcome. Thanks.

while(getline(my_file, myStr))
{
   characterCount += myStr.length();

   lineCount++;


   if(my_file.eof()){

      break;

   }
}

This was the only code at run time for the test. 2hours, 30+min. 45-50% total processor for the program running it on a dual core 1.6Mhz laptop with 2GB RAM. Most of the RAM loaded right now is 600+MB from ~50 tabs open in firefox, Visual Studio at 60MB, then etcs.

IMPORTANT: During the test, the program running the code, which is only a window, and a dialog box, seemed to dump it's own working and private set of ram, down to like 300K ish, and didn't respond for the length of the test. I need to make another thread for the while statement I'm sure. But this means that NONE of the file was read into a buffer. The CPU was struggling for the entire run to keep up with the tinyest effort from the hard drive.

P.S. Further proof of CPU bottlenecking. It might take me 20min to transfer than entire file to another computer over my wireless network. Which includes the read process and a socket catch to write process on the other computer.

UPDATE

I used this adorable little thing to go from the previous test time to about 15-20min which is in line with what Mats Petersson was saying.

while (my_file.read( &bufferOne[0], bufferOne.size() ))

{

int cc = my_file.gcount();

for (int i = 0; i < cc; i++)
{

    if (bufferOne[i] == '\n')
        lineCount++;

    characterCount++;

}

currentPercent = characterCount/onePercent;

SendMessage(GetDlgItem(hDlg, IDC_GENPROGRESS), PBM_SETPOS, currentPercent, 0);

}

Granted this is a single loop and it actually behaved much more appropriately than the previous test. This test was ~800% faster than the tight loop shown above this one with Getline. I set the buffer for this loop at 20MB. I jacked this code from: SOF - Fastest Example

BUT...

I would like to point out that while polling the process in resource mon and task manager, it clearly showed the first core at 75-90% usage, the second fluxuately 25-50% (Pretty standard for some minor background stuff that I have open), and the hard drive at.. wait for it... 50%. Some 100% disk time spikes but also some lows at 25%. All of which basically means that Splitting the buffer processing between two different threads could very well be a benefit. It will use all the system resources but.. that's what I want. I'll update later today when I have the working prototype.

MAJOR UPDATE: Finally finished my project after a bunch of learning. No File Map needed. Only a bunch of vector char's. I have successfully built a dynamically executing file stream line and character counter. The good news, went from the previous 10-15min marker to ~3-4min on a 5.8GB file, BOOYA!~

Community
  • 1
  • 1
LightKeep
  • 33
  • 1
  • 5
  • Isn't this linear parsing what SAX was designed for? I don't think it will allow for multiple machines doing it though. – Deanna Mar 12 '13 at 16:14
  • I'm using multiple machines for the import process afterwards. The parsing and dividing of the file will only be done on one system. – LightKeep Mar 12 '13 at 16:25
  • Ok, so first thing that is fairly obvious is that a 5.6GB file will probably contain about 100M newlines, so that's 100M divide operations (some 20-30 clock cycles on its own) and 100M calls to SendMessage() and GetDlgItem(). Both of these are potentially heavier than the enter reading of the string. Add something like `if ((lineCount & 1024) == 0) { currentPercent = ....; SendMessage(...); }` – Mats Petersson Mar 12 '13 at 23:26
  • Not sure if you saw my comment below but the SendMessage and current percent bit was added afer the build/run test. It was only the while statement with the char count, line count, and break statement. – LightKeep Mar 12 '13 at 23:31
  • Your right about it though. I don't need it sending 100Million SendMessage requests, sure makes a smooth transition though ;) – LightKeep Mar 12 '13 at 23:34
  • Are you running a debug or release build of the code? I suspect debug - there is NO way that `getline()` should take that long. – Mats Petersson Mar 12 '13 at 23:42
  • It's Debug, your right. Sorry for the delay, I didn't see your comment. – LightKeep Mar 13 '13 at 00:57

1 Answers1

1

Very short answer: Yes, you can use those functions.

For reading data, it's likely the most efficient method to map the file content into memory, since it saves having to copy the memory into a buffer in the application, just read it straight into the place it's supposed to go. So, no problem as long as you have enough address space available - 64-bit machines should certainly have plenty, in a 32-bit system it may be more of a scarce resource - but for sections of a few hundred MB, it shouldn't be a huge issue.

However, using multiple threads, I'm not at all convinced. I have a fair idea that reading more than one part of a very large file will be counter productive. This will increase the amount of head movement on the disk, which is a large portion of transfer rate. You can count on some 50-100MB/s transfer rates for "ordinary" systems. If the system has some sort of raid controller or some such, maybe around double that - very exotic raid controllers may achieve three times.

So reading 40GB will take somewhere in the order of 3-15 minutes.

The CPU is probably not going to be very busy, and running multiple threads is quite likely to worsen the overall performance of the system.

You may want to keep a thread for reading and one for writing, and only actually write out the data once you have a sufficient amount of it, again, to avoid unnecessary moves of the read/write head on the disk(s).

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • This is my thinking more or less, except that I wanted to make the processors work more. When I'm doing these sorts of operations the Hard Drive seems to bottle neck the fastest. I won't write the data to log until the operation is complete but I was thinking multiple threads so I could max the processors usage. Also I wanted seperate threads because I'm polling usage statistics every 30seconds. I wanted to make the number of threads dynamic so the system was brought as close to 100% in any area as possible. – LightKeep Mar 12 '13 at 16:24
  • By all means try this - create a large file, read from different sections in different threads [make it configurable how many threads it has], and take the time. But I expect the throughput to be lower. – Mats Petersson Mar 12 '13 at 16:26
  • Hmm, that's why I was going to try and take a 120MB chunk and split the chuck into the pieces to read from in memory, so that, more or less, the processors would keep pace with one another and it will always be reading from the same 120MB chunk in Memory. 120MB/6 = 6 * 20MB buffers in Memory all from the same area of the file from the disks perspective. Then when one 20MB is finished the disk just continues to read from the same place it left off for another 20MB. – LightKeep Mar 12 '13 at 16:30
  • I might be misunderstanding your knowledge about what the hard drive is capable of. In which case, I'll have to rethink and simplify to one 'while' statement. But if you have any other ideas or comments, I'd love to hear them. – LightKeep Mar 12 '13 at 16:35
  • Yes, but the time it takes a modern processor to process 20MB is FAR shorter than the time it takes for the hard disk to deliver the next 20MB. However, reading SMALL sections is actually faster than reading large sections - but with a memory mapped file, that's not a bit issue, as the OS will read in bits of the file as and when it's needed, in blocks of (typically) 4KB. – Mats Petersson Mar 12 '13 at 16:45
  • Maybe it's the multithreading that's a problem but here's what I'm seeing. 2 processors = 2 kids sitting at a table, RAM = the table, Hard disk = Box filled with rubix cubes next to the table. Read process from the Hard disk to Memory = A guy putting rubix cubes on the table. We start with 6 rubix cubes on the table, when one is solved the kid picks up the next rubix cube that is already on the table and while he is solving the new one the guy is grabbing another 'extra' rubix and setting it on the table. – LightKeep Mar 12 '13 at 16:47
  • So, if we fetch the Rubik's cube from the shop about 20 minutes away, and the children sitting at the table can solve the cube in about 1 second, your analogy works really well. – Mats Petersson Mar 12 '13 at 16:51
  • Ah, so the processors are just too fast, in which case the kids will all be fighting over the next rubix cube because they will have solved the first 6 faster than the guy can put them on the table. So the fighting will take more computation and slow the whole process down. But using only 1 processor will solve the problem? – LightKeep Mar 12 '13 at 16:52
  • Lol, I'd pay money to see a kid solve a rubix cube in 1 second. – LightKeep Mar 12 '13 at 16:53
  • So the hardrive is usually the bottle neck your saying. I guess I'm just surprised that modern HD's haven't caught pace a little more with cpus. Bummer. It wassssss a good idea... – LightKeep Mar 12 '13 at 16:56
  • Hard disks are indeed quite slow. If I write a piece of code that, say, fills 10GB of RAM, it will take a couple of seconds at most. Writing 10GB to disk will take a couple of minutes. And as soon as you start asking for the read to happen from a different place, you get MORE problems [your guy fetching the rubik's cubes would have to look at a map and find another shop to get them from]. – Mats Petersson Mar 12 '13 at 16:59
  • By all means, DO measure it. It may be that you can speed it up. But I doubt it. – Mats Petersson Mar 12 '13 at 17:00
  • Yes, because of disk transfer rate's limitations you should split this large file in 64 MBt blocks not among cores but among separate machines over network (placed close to each other, say in same rack to decrease network load). At least that is what Google does in its data centers consisting of many conventional servers. – SChepurin Mar 12 '13 at 17:06
  • That's exactly what I was about to post, but it wouldn't be worth it for me because of the file copy/transfer times. BUMMER. New expression though: Being Clever in the wrong way, is not Clever at all. – LightKeep Mar 12 '13 at 17:12
  • "Being Clever in the wrong way, is not Clever at all." - that is exactly what you are about to implement. You can not do it effectively on one machine with many cores and huge memory. – SChepurin Mar 12 '13 at 17:19
  • This was a great discussion, I'd upvote you if I could, thanks for it. – LightKeep Mar 12 '13 at 17:34
  • Well guys, I hate to say it. But I was right to begin with. I just ran a test on a 5GB text file, counting characters and lines with a single 'while' statment running. I haven't even implemented the Progress Bar Send Messages and the 1st Core of Processor is maxed out, reading at 1MB per second with the 2nd Core more or less silent. 1MB/sec is at worst 1/30th of the potentional HD read throughput. On top of all this, it's still running in the background, after nearly 45 min on a 1.6 Mhz dual core laptop. Soooo... back to my Question.. How do I Map 120MB and divide it into equal parts??? – LightKeep Mar 12 '13 at 20:40
  • Can you post the code you have so far? I don't have a Windows dev environment set up, so can't really test/produce Windows specific code, but if you post a bit of what you've done so (in the question itself), I'm sure we can help. Most important is how you are actually reading the file. I have posted about "how to read files" in three or four different questions previously, so I have a fair idea of some ways to "get closer to the actual disk transfer rate". – Mats Petersson Mar 12 '13 at 22:15
  • You didn't get the part of "in your original question", did you? Reading code in comments is near on impossible... ;) – Mats Petersson Mar 12 '13 at 23:18
  • Sorry, posting code in these comment boxes isn't as nice as posting it elsewhere. It took 2:40ish min to complete that code on a 5.7GB file. One core was between 90-98% the entire time and the hard drive was spiking periodically but only 600-900KB a second transfer, so around 0-3% drive time. Via Task Manager and Resourse Mon. – LightKeep Mar 12 '13 at 23:19
  • The Progress bar 'SendMessage's where added after the fact. So they aren't included in time for the test run. – LightKeep Mar 12 '13 at 23:25
  • @LightKeep: Did you account for the fact that a 5GB file might be in your RAM cache, especially if you've been copying it or accessing it recently? The 3% drive time tells me this must be the case. – Zan Lynx Mar 12 '13 at 23:27
  • The file hasn't been loaded except briefly by a 30MB RAM cache PHP script since last restart. I have only 2GB Memory. – LightKeep Mar 12 '13 at 23:29