Doing file operations with 64-bit addresses in C + MinGW32

Question

I'm trying to read in a 24 GB XML file in C, but it won't work. I'm printing out the current position using ftell() as I read it in, but once it gets to a big enough number, it goes back to a small number and starts over, never even getting 20% through the file. I assume this is a problem with the range of the variable that's used to store the position (long), which can go up to about 4,000,000,000 according to http://msdn.microsoft.com/en-us/library/s3f49ktz(VS.80).aspx, while my file is 25,000,000,000 bytes in size. A long long should work, but how would I change what my compiler(Cygwin/mingw32) uses or get it to have fopen64?

I dont have internet lots of the time, and I think wikipedia would be useful, so I downloaded it, and now im trying to compress it and make a reader for it — zacaj, Oct 14 '09 at 20:50
Do you actually fail to read the file in, or just fail to find the size of the file? — Adrian Mouat, Oct 14 '09 at 20:50
Fail to read it. Im going through the file, finding all the titles in the first stage, and storing them along with their position, and then ill go through again, and compress the articles in chunks, indexing them in a main file, but it never gets through the file, since the position wraps around, and I start reading the beginning again — zacaj, Oct 14 '09 at 20:53
Off topic, but I'm a little surprised all of wikipedia is only 24gigs (including XML overhead) — Dana, Oct 14 '09 at 20:54
Me too, but it actually seems pretty close... 24 billion letters...(its only english)... It sure looks like it has everything(just from browsing the titles my program prints out. Plus, take a look at encyclopedia for ipod(where I got the idea), they fit the entire thing in 1-2GB, and display it on an iPod — zacaj, Oct 14 '09 at 20:59
If you successfully read that file, you could say "I'd read entire Wikipedia" — Rubens Farias, Oct 14 '09 at 21:00
Assuming that's just text, I'm surprised it's that **large**. A standard (hard-copy) encyclopedia has about, say, 4KB of text per page (~90 columns, 50 rows). Say, 700 pages per alphabetical*26 volumes, 73MB/World Book. Round up to 100MB. So, Wikipedia is 240 World Books. — Michael Petrotta, Oct 14 '09 at 21:03
Have you seen this: http://arstechnica.com/gadgets/news/2009/10/openmoko-offline-reader-puts-wikipedia-in-your-pocket.ars — mocj, Oct 14 '09 at 21:04
I know your programming question is good and valid, but aren't there some existing offline wikipedia tools? I saw this but don't know if it is of interest to you (maybe you just want the fun of doing it yourself): http://blog.fupps.com/2008/05/20/wikitaxi-use-a-local-copy-of-wikipedia/ — Donald Byrd, Oct 14 '09 at 21:04
@mocj: I was just going to suggest that. I have one of the OpenMoko freerunner phones. Cool device, though clearly a beta. IIRC, the new wiki reader thingy has 3 million articles and it takes 4GB. — rmeador, Oct 14 '09 at 21:20
@Michael Petrotta Wikipedia is *huge*. http://en.wikipedia.org/wiki/File:Size_of_English_Wikipedia_in_August_2007.svg — Tim Sylvester, Oct 14 '09 at 21:24
@Tim: that looks like about 63 World Books. It might have gotten within spitting distance of my estimate in the two years since. — Michael Petrotta, Oct 14 '09 at 23:50
Wouldn't it be easier to use wget or something similar to just download each page, etc., as an individual file? File systems are very good at dealing with things like this :) — Justin R., Oct 18 '09 at 20:22
I'm not a C programmer, but couldn't you use a memory mapped file? — RCIX, Apr 17 '10 at 01:17

score 3 · Answer 1 · answered Oct 14 '09 at 20:42

3

The ftell() function typically returns an unsigned long, which only goes up to 2³² bytes (4 GB) on 32-bit systems. So you can't get the file offset for a 24 GB file to fit into a 32-bit long.

You may have the ftell64() function available, or the standard fgetpos() function may return a larger offset to you.

answered Oct 14 '09 at 20:42

David R Tribble

11,918
5
42
52

I dont have ftell64(), and fgetpos() returns the same thing as ftell() – zacaj Oct 14 '09 at 20:44

Dolphin · Answer 2 · 2009-10-14T21:04:54.787

3

You might try using the OS provided file functions CreateFile and ReadFile. According to the File Pointers topic, the position is stored as a 64bit value.

edited Oct 14 '09 at 21:04

answered Oct 14 '09 at 20:57

Dolphin

4,655
1
30
25

2

Don't scare people, those are C functions and part of the windows API :) – Dolphin Oct 14 '09 at 21:10

Adrian Mouat · Answer 3 · 2009-10-14T21:33:30.500

0

Unless you can use a 64-bit method as suggested by Loadmaster, I think you will have to break the file up.

This resource seems to suggest it is possible using _telli64(). I can't test this though, as I don't use mingw.

edited Oct 14 '09 at 21:33

answered Oct 14 '09 at 21:01

Adrian Mouat

44,585
16
110
102

But theres no compiler option or anything to enable them? I can see them in the include files, but theyre under an #ifdef. – zacaj Oct 14 '09 at 21:18

score 0 · Answer 4 · answered Oct 14 '09 at 21:39

I don't know of any way to do this in one file, a bit of a hack but if splitting the file up properly isn't a real option, you could write a few functions that temp split the file, one that uses ftell() to move through the file and swaps ftell() to a new file when its reaching the split point, then another that stitches the files back together before exiting. An absolutely botched up approach, but if no better solution comes to light it could be a way to get the job done.

score 0 · Answer 5 · edited May 23 '17 at 12:13

I found the answer. Instead of using fopen, fseek, fread, fwrite... I'm using _open, lseeki64, read, write. And I am able to write and seek in > 4GB files.

Edit: It seems the latter functions are about 6x slower than the former ones. I'll give the bounty anyone who can explain that.

Edit: Oh, I learned here that read() and friends are unbuffered. What is the difference between read() and fread()?

score -1 · Answer 6 · answered Feb 15 '10 at 20:07

Even if the ftell() in the Microsoft C library returns a 32-bit value and thus obviously will return bogus values once you reach 2 GB, just reading the file should still work fine. Or do you need to seek around in the file, too? For that you need _ftelli64() and _fseeki64().

Note that unlike some Unix systems, you don't need any special flag when opening the file to indicate that it is in some "64-bit mode". The underlying Win32 API handles large files just fine.

Doing file operations with 64-bit addresses in C + MinGW32

6 Answers6