2

I have a file with space-separated numbers. It's size is about 1Gb and I want to get the numbers from it. I've decided to use Memory Mapped Files to read fast, but i don't understand how to do it. I tried to do next:

var mmf = MemoryMappedFile.CreateFromFile("test", FileMode.Open, "myFile");
var mmfa = mmf.CreateViewAccessor(0, 0, MemoryMappedFileAccess.Read);
var nums = new int[6];
var a = mmfa.ReadArray<int>(0, nums, 0, 6); 

But if "test" contains just "01" in num[0] I get 12337. 12337 = 48*256+49. I've searched in the internet but didn't find anything about my question. only about byte arrays or interprocess communication. Can you show me how to get 1 in num[0]?

Jeff LaFay
  • 12,882
  • 13
  • 71
  • 101
Vasilii Ruzov
  • 554
  • 1
  • 10
  • 27
  • 1
    If your data is ASCII you need to parse it before you can convert it to an int. Another option would be to write a converter which does read your file line by line and write the integers as binary values into the file. Then you can use your approach above to read the integers. – Alois Kraus May 05 '12 at 19:28
  • @Alois: nice catch, the OP actually wants to convert from ASCII into binary representation. – Vlad May 05 '12 at 19:31
  • If speed is your main concern it might help to look at: http://stackoverflow.com/questions/7153315/how-to-parse-a-text-file-in-c-sharp-and-be-io-bound – Alois Kraus May 05 '12 at 20:05

3 Answers3

4

The following example will read from ASCII integers from a memory mapped file in the fastest way possible without creating any strings. The solution provided by MiMo is much slower. It does run at 5 MB/s which will not help you much. The biggest issue of the MiMo solution is that it does call a method (Read) for every char which costs a whooping factor 15 of performance. I wonder why you accepted his solution if your original issue was that you had a performance issue. You can get 20 MB/s with a dumb string reader and parsing the string into an integer. To get every byte via a method call does ruin your possible read performance.

The code below does map the file in 200 MB chunks to prevent filling up the 32 bit address space. Then it does scan through the buffer with an byte pointer which is very fast. The integer parsing is easy if you do not take localization into account. What is interesting that if I do create a View of the mapping that the only way to get a pointer to the view buffer does not allow me to start at the mapped region.

I would consider this a bug in the .NET Framwork which is still not fixed in .NET 4.5. The SafeMemoryMappedViewHandle buffer is allocated with the allocation granularity of the OS. If you advance to some offset you get a pointer back which does still point to the start of the buffer. This is really unfortunate because this makes the difference between 5MB/s and 77MB/s in parsing performance.

Did read 258.888.890 bytes with 77 MB/s


using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.IO.MemoryMappedFiles;
using System.Runtime.InteropServices;

unsafe class Program
{
    static void Main(string[] args)
    {
        new Program().Start();
    }

    private void Start()
    {
        var sw = Stopwatch.StartNew();
        string fileName = @"C:\Source\BigFile.txt";//@"C:\Source\Numbers.txt";
        var file = MemoryMappedFile.CreateFromFile(fileName);
        var fileSize = new FileInfo(fileName).Length;
        int viewSize = 200 * 100 * 1000;
        long offset = 0;
        for (; offset < fileSize-viewSize; offset +=viewSize ) // create 200 MB views
        {
            using (var accessor = file.CreateViewAccessor(offset, viewSize))
            {
                int unReadBytes = ReadData(accessor, offset);
                offset -= unReadBytes;
            }
        }

        using (var rest = file.CreateViewAccessor(offset, fileSize - offset))
        {
            ReadData(rest, offset);
        }
        sw.Stop();
        Console.WriteLine("Did read {0:N0} bytes with {1:F0} MB/s", fileSize, (fileSize / (1024 * 1024)) / sw.Elapsed.TotalSeconds);
    }


    List<int> Data = new List<int>();

    private int ReadData(MemoryMappedViewAccessor accessor, long offset)
    {
        using(var safeViewHandle = accessor.SafeMemoryMappedViewHandle)
        {
            byte* pStart = null;
            safeViewHandle.AcquirePointer(ref pStart);
            ulong correction = 0;
            // needed to correct offset because the view handle does not start at the offset specified in the CreateAccessor call
            // This makes AquirePointer nearly useless.
            // http://connect.microsoft.com/VisualStudio/feedback/details/537635/no-way-to-determine-internal-offset-used-by-memorymappedviewaccessor-makes-safememorymappedviewhandle-property-unusable
            pStart = Helper.Pointer(pStart, offset, out correction);
            var len = safeViewHandle.ByteLength - correction;
            bool digitFound = false;
            int curInt = 0;
            byte current =0;
            for (ulong i = 0; i < len; i++)
            {
                current = *(pStart + i);
                if (current == (byte)' ' && digitFound)
                {
                    Data.Add(curInt);
                  //  Console.WriteLine("Add {0}", curInt);
                    digitFound = false;
                    curInt = 0;
                }
                else
                {
                    curInt = curInt * 10 + (current - '0');
                    digitFound = true;
                }
            }

            // scan backwards to find partial read number
            int unread = 0;
            if (curInt != 0 && digitFound)
            {
                byte* pEnd = pStart + len;
                while (true)
                {
                    pEnd--;
                    if (*pEnd == (byte)' ' || pEnd == pStart)
                    {
                        break;
                    }
                    unread++;

                }
            }

            safeViewHandle.ReleasePointer();
            return unread;
        }
    }

    public unsafe static class Helper
    {
        static SYSTEM_INFO info;

        static Helper()
        {
            GetSystemInfo(ref info);
        }

        public static byte* Pointer(byte *pByte, long offset, out ulong diff)
        {
            var num = offset % info.dwAllocationGranularity;
            diff = (ulong)num; // return difference

            byte* tmp_ptr = pByte;

            tmp_ptr += num;

            return tmp_ptr;
        }

        [DllImport("kernel32.dll", SetLastError = true)]
        internal static extern void GetSystemInfo(ref SYSTEM_INFO lpSystemInfo);

        internal struct SYSTEM_INFO
        {
            internal int dwOemId;
            internal int dwPageSize;
            internal IntPtr lpMinimumApplicationAddress;
            internal IntPtr lpMaximumApplicationAddress;
            internal IntPtr dwActiveProcessorMask;
            internal int dwNumberOfProcessors;
            internal int dwProcessorType;
            internal int dwAllocationGranularity;
            internal short wProcessorLevel;
            internal short wProcessorRevision;
        }
    }

    void GenerateNumbers()
    {
        using (var file = File.CreateText(@"C:\Source\BigFile.txt"))
        {
            for (int i = 0; i < 30 * 1000 * 1000; i++)
            {
                file.Write(i.ToString() + " ");
            }
        }
    }

}
Alois Kraus
  • 13,229
  • 1
  • 38
  • 64
  • I don't actually think the memory-mapped files will make any speed-up in comparison to regular files. – Vlad May 05 '12 at 20:03
  • If speed is the main concern a binary format would be the way to go. – Alois Kraus May 05 '12 at 20:08
  • @Vlad: I have updated my answer to get the maximum speed out of it. It does beat Mimos solution by a factor 11. – Alois Kraus May 07 '12 at 11:56
  • great code. works really fast! actually it matchs my requirements – Vasilii Ruzov May 08 '12 at 17:37
  • Thanks. It won´t get much faster. You can preallocte the list with a fixed size if you know roughly how much data is about to be read to minimize list reallocations. But then you have really reached the limits what you can do with .NET. ... But wait you can get faster if you split the file into two views and use two threads to read the data and later concat the second view. That could give you another factor 2. – Alois Kraus May 08 '12 at 19:36
1

You need to parse the file content, converting the characters into numbers - something like this:

List<int> nums = new List<int>();
long curPos = 0;
int curV = 0;
bool hasCurV = false;
while (curPos < mmfa.Capacity) {
  byte c;
  mmfa.Read(curPos++, out c);
  if (c == 0) {
    break;
  }
  if (c == 32) {
    if (hasCurV) {
      nums.Add(curV);
      curV = 0;
    }
    hasCurV = false;
  } else {
    curV = checked(curV*10 + (int)(c-48));
    hasCurV = true;
  }
}
if (hasCurV) {
  nums.Add(curV);
}

assuming that mmfa.Capacity is the total number of characters to read, and that the file contains only digits separated by space (i.e. no end lines or other white spaces)

MiMo
  • 11,793
  • 1
  • 33
  • 48
  • you need to process newlines as well :) and watch for overflows – Vlad May 05 '12 at 19:47
  • @Vlad: thanks - fixed (except for the new line - 'assuming....no end lines or other white spaces') – MiMo May 05 '12 at 19:52
  • overflow exception in `curV = checked(curV*10 + (int)(c-48));` – Vasilii Ruzov May 05 '12 at 20:04
  • @VasiliiRuzov: if you have numbers bigger that `2^31-1` in your file the overflow is the expected behavior (caused by `checked`) - if not there is a bug in my code... – MiMo May 05 '12 at 20:10
  • no. i have numbers not bigger that 2000. i added `if (c == 0) break;` and now it works – Vasilii Ruzov May 05 '12 at 20:11
  • I see - getting 0 at the end of the file..I am fixing the code – MiMo May 05 '12 at 20:17
  • -1 for using Read which brings you down to 5 MB/s parsing speed. My enhanced solution can get 77MB/s. Do not call a method for every byte of the file. – Alois Kraus May 05 '12 at 23:03
0

48 = 0x30 = '0', 49 = 0x31 = '1'

So you get really your characters, they are just ASCII-encoded.

The string "01" takes 2 bytes, which fit into one int, so you get them both served in one int. If you want to get them separately, you need to ask for array of bytes.


Edit: in case when "01" needs to be parsed into a constant 1, i.e., from ASCII representation into binary, you need to go other way. I would suggest

  1. do not use memory mapped file,
  2. read a file with StreamReader line by line (see example here)
  3. split each line to chunks using string.Split
  4. parse each chunk into number using string.Parse
Vlad
  • 35,022
  • 6
  • 77
  • 199
  • actually i want to know how to read the all numbers from the file. – Vasilii Ruzov May 05 '12 at 19:22
  • @Vasilii: see the edited post: you need to read chars, not ints. – Vlad May 05 '12 at 19:24
  • ok, so there is no way to get actually INT from file? after I get the byte array i'll have to parse it? – Vasilii Ruzov May 05 '12 at 19:26
  • 1
    @Vasilii: sorry, better is bytes :) because char is 2 bytes long in C# – Vlad May 05 '12 at 19:26
  • @Vasilii: look, the file contains 2 bytes: 0x30 and 0x31. If you read the file by 2-byte units, you get the both bytes in 1 step. If you read the file by 1-byte units, you get the chars separately. – Vlad May 05 '12 at 19:28
  • @Vasilii: or do you want the string represented as an int, so you want to get the _value_ of 1? then it's a different story, you have to get the string "1" and pass it to `int.Parse` function. – Vlad May 05 '12 at 19:30
  • Vlad, i think you didn't understand my question. 01 - it's just an example of file. I have smth like : 12 342 4 32 1 3 52 26 ... and i need to create an int array from these numbers – Vasilii Ruzov May 05 '12 at 19:32
  • 2
    Then parsing is your easiest option. You don't have a file of ints, you have a file of characters which happen to be in the range '0' to '9' delimited by spaces. – Tony Hopkinson May 05 '12 at 19:38
  • Vlad, StreamReader is too slow. that's why i've asked about memory mapping. – Vasilii Ruzov May 05 '12 at 19:39
  • @Vasilii: is it really? what are your constraints? how large is the file? – Vlad May 05 '12 at 19:41
  • @Vlad: 1-10Gb. line-by-line reading is slow. i've read by 64kb but it's still slow. – Vasilii Ruzov May 05 '12 at 19:45
  • @Vasilii: memory-mapped files are not faster than plain files when it goes about reading from the disk. If you use the file for information exchange inside your processes, you shouldn't write into the mmf the ascii representation, just write binary numbers into it. – Vlad May 05 '12 at 19:45
  • input file is created not by me. so i have to work with space-separated numbers( – Vasilii Ruzov May 05 '12 at 19:48