-2

simple question:

How can I use a huge two-dimensional array in C#? What I want to do is the following:

int[] Nodes = new int[1146445];

int[,] Relations = new int[Nodes.Lenght,Nodes.Lenght];

It just figures that I got an out of memory error.

Is there a chance to work with such big data in-memory? (4gb RAM and a 6 core CPU)^^

The integers I want to save in the two-dimensional array are small. I guess from 0 to 1000.

Update: I tried to save the Relations using Dictionary<KeyValuePair<int, int>, int>. It works for some adding loops. Here is the class wich should create the graph. The instance of CreateGraph get's its data from a xml streamreader.

Main (C# backgroundWorker_DoWork)

ReadXML Reader = new ReadXML(tBOpenFile.Text);
CreateGraph Creater = new CreateGraph();

int WordsCount = (int)nUDLimit.Value;
if (nUDLimit.Value == 0) WordsCount = Reader.CountWords();

// word loop
for (int Position = 0; Position < WordsCount; Position++)
{
    // reading and parsing
    Reader.ReadNextWord();

    // add to graph builder
    Creater.AddWord(Reader.CurrentWord, Reader.GetRelations(Reader.CurrentText));
}

string[] Words = Creater.GetWords();
Dictionary<KeyValuePair<int, int>, int> Relations = Creater.GetRelations();

ReadXML

class ReadXML
{
    private string Path;
    private XmlReader Reader;
    protected int Word;
    public string CurrentWord;
    public string CurrentText;

    public ReadXML(string FilePath)
    {
        Path = FilePath;
        LoadFile();
        Word = 0;
    }

    public int CountWords()
    {
        // caching
        if(Path.Contains("filename") == true) return 1000;

        int Words = 0;
        while (Reader.Read())
        {
            if (Reader.NodeType == XmlNodeType.Element & Reader.Name == "word")
            {
                Words++;
            }
        }

        LoadFile();

        return Words;
    }

    public void ReadNextWord()
    {
        while(Reader.Read())
        {
            if(Reader.NodeType == XmlNodeType.Element & Reader.Name == "word")
            {
                while (Reader.Read())
                {
                    if (Reader.NodeType == XmlNodeType.Element & Reader.Name == "name")
                    {
                        XElement Title = XElement.ReadFrom(Reader) as XElement;
                        CurrentWord = Title.Value;

                        break;
                    }
                }
                while(Reader.Read())
                {
                    if (Reader.NodeType == XmlNodeType.Element & Reader.Name == "rels")
                    {
                        XElement Text = XElement.ReadFrom(Reader) as XElement;
                        CurrentText = Text.Value;

                        break;
                    }
                }
                break;
            }
        }
    }

    public Dictionary<string, int> GetRelations(string Text)
    {
        Dictionary<string, int> Relations = new Dictionary<string,int>();

        string[] RelationStrings = Text.Split(';');

        foreach (string RelationString in RelationStrings)
        {
            string[] SplitString = RelationString.Split(':');

            if (SplitString.Length == 2)
            {
                string RelationName = SplitString[0];
                int RelationWeight = Convert.ToInt32(SplitString[1]);

                Relations.Add(RelationName, RelationWeight);
            }
        }

        return Relations;
    }

    private void LoadFile()
    {
        Reader = XmlReader.Create(Path);
        Reader.MoveToContent();
    }
}

CreateGraph

class CreateGraph
{
    private Dictionary<string, int> CollectedWords = new Dictionary<string, int>();
    private Dictionary<KeyValuePair<int, int>, int> CollectedRelations = new Dictionary<KeyValuePair<int, int>, int>();

    public void AddWord(string Word, Dictionary<string, int> Relations)
    {
        int SourceNode = GetIdCreate(Word);
        foreach (KeyValuePair<string, int> Relation in Relations)
        {
            int TargetNode = GetIdCreate(Relation.Key);
            CollectedRelations.Add(new KeyValuePair<int,int>(SourceNode, TargetNode), Relation.Value);  // here is the error located
        }
    }

    public string[] GetWords()
    {
        string[] Words = new string[CollectedWords.Count];

        foreach (KeyValuePair<string, int> CollectedWord in CollectedWords)
        {
            Words[CollectedWord.Value] = CollectedWord.Key;
        }

        return Words;
    }

    public Dictionary<KeyValuePair<int,int>,int> GetRelations()
    {
        return CollectedRelations;
    }

    private int WordsIndex = 0;
    private int GetIdCreate(string Word)
    {
        if (!CollectedWords.ContainsKey(Word))
        {
            CollectedWords.Add(Word, WordsIndex);
            WordsIndex++;
        }
        return CollectedWords[Word];
    }

}

Now I get another error: An element with the same key already exists. (At the Add in the CreateGraph class.)

danijar
  • 32,406
  • 45
  • 166
  • 297
  • 4
    Are you *sure* it's only 10,000? That's only allocating 400MB, which even my netbook can handle... – Jon Skeet Mar 05 '12 at 17:45
  • Badly it is exactely 1.146.445. – danijar Mar 05 '12 at 17:49
  • 2
    Why would you show us `10000` and not what you *actually* have (`1146445`) How about just copy and paste your actual code? (`.lenght` doesn't compile) – Marlon Mar 05 '12 at 17:50
  • @sharethis: So your real code is trying to allocate something over 100x as large as what you've shown? No wonder it doesn't work. – Jon Skeet Mar 05 '12 at 17:51
  • My actual code is more complicated. The right array size is code generated so I coudn't show you this in the code. I used the debugger to get the number of 1146445. – danijar Mar 05 '12 at 17:57
  • My mistake: not 100x as large, 10,000x as large. – Jon Skeet Mar 05 '12 at 17:58
  • @sharethis: Did you really expect to be able to allocate an array with ~5TB of data in? – Jon Skeet Mar 05 '12 at 18:00
  • Are there lots of zeros (don't cares) in Relations? – H H Mar 05 '12 at 18:01
  • 1
    Since your integers only range from 0 to 1000, you could use `short` rather than `int`. That would reduce your memory needs by 50%. That wouldn't be enough to overcome this problem, however. – phoog Mar 05 '12 at 18:02
  • @HenkHolterman: Yes, there are. Is there a better solution then? – danijar Mar 05 '12 at 18:03
  • sharethis: try to write down what you actually need. This is half-way down a solution that isn't going to work. – H H Mar 05 '12 at 18:05
  • I want to simulate a "brain". That's why I have to calculate with a huge graph in-memory. – danijar Mar 05 '12 at 18:10

3 Answers3

2

You'll have a better chance when you set Relations up as a jagged array (array of array) :

//int[,] Relations = new int[Nodes.Length,Nodes.Length];
int[][] Relations = new int[Nodes.length] [];
for (int i = 0; i < Relations.Length; i++)
    Relations[i] = new int[Nodes.Length];

And then you still need 10k * 10k * sizeof(int) = 400M

Which should be possible, even when running in 32 bits .

Update:

With the new number, it's 1M * 1M * 4 = 4 TB, that' not going to work.
And using short to replace int will only bring it down to 2 TB


Since you seem to need to assign weights to (sparse) connections between nodes, you should see if something like this could work:

struct WeightedRelation 
{ 
   public readonly int node1;
   public readonly int node2;
   public readonly int weight;
}

int[] Nodes = new int[1146445];

List<WeightedRelation> Relations = new List<WeightedRelation>();
Relations.Add(1, 2, 10);
...

This just the basic idea, you may need a double dictionary to do fast lookups. But your memory size would be proportional to the number of actual (non 0) relations.

Community
  • 1
  • 1
H H
  • 263,252
  • 30
  • 330
  • 514
  • 2
    Quite the reverse - if you're going to allocate every "row", a jagged array takes up more memory than a rectangular array, because of the overhead of each array separate object. Your code allocates 10,001 array objects vs the 1 in the original code (for `Relations`). – Jon Skeet Mar 05 '12 at 17:46
  • 2
    @JonSkeet - Yes, it does take up (a little) more space but there is a much better chance of finding that space in the LOH. – H H Mar 05 '12 at 17:50
  • ... although there is one benefit of this approach: it doesn't require as much *contiguous* virtual memory, and won't run into the problem of very large objects as early. Maybe that's what you meant? (Even the existing code is fine on my 32-bit netbook, btw.) – Jon Skeet Mar 05 '12 at 17:50
  • 1
    @JonSkeet "Even the existing code is fine" - in an app that just started, of course. But allocate + deallocate some big stuff of varying sizes first. – H H Mar 05 '12 at 17:54
  • @JonSkeet: "Even the existing code is fine on my 32-bit netbook" - are you saying you managed to allocate ~ 4*1146445^2 bytes of memory on your netbook? – Igor Korkhov Mar 05 '12 at 17:55
  • @IgorKorkhov: When I wrote that comment, my screen was still showing the original code with only 10,000 elements in `Nodes`. – Jon Skeet Mar 05 '12 at 17:56
  • @IgorKorkhov The problem is that the OP thought it would be OK to show us a different number of elements that was actually allocated (in a question that has to do with insufficient memory, go figure). Keep in mind that the original post was `new int[10000]`, yet the REAL code is `new int[1146445]` – Marlon Mar 05 '12 at 17:58
  • I am sorry. I posted a wrong array size first. That's why I wondered why I got out-of-memory-errors because I thought I would be about 10,000. – danijar Mar 05 '12 at 18:05
  • @JonSkeet, Marlon: I see, but I was hoping to update Jon Skeet Facts (http://meta.stackexchange.com/a/9182/167729) with the new one - "Jon Skeet easily allocates 5 terabytes on his 32-bit netbook" - when the OP spoiled the party :) – Igor Korkhov Mar 05 '12 at 18:06
  • Thanks, can you give me an example of "a double dictionary", please? – danijar Mar 05 '12 at 18:22
2

Okay, now we know what you're really trying to do...

int[] Nodes = new int[1146445];

int[,] Relations = new int[Nodes.Length ,Nodes.Length];

You're trying to allocate a single object which has 1,314,336,138,025 elements, each of size 4 bytes. That's over 5,000 GB. How exactly did you expect that to work?

Whatever you do, you're obviously going to run out of physical memory for that many elements... even if the CLR let you allocate a single object of that size.

Let's take a small exampler of 50,000, where you end up with ~9GB of requires space. I can't remember what the current limit is (which depends on CLR version number and whether you're using the 32 or 64-bit CLR) but I don't think any of them will support that.

You can break your array up into "rows" as shown in Henk's answer - that will take up more memory in total, but each array will be small enough to cope with on its own in the CLR. It's not going to help you fit the whole thing into memory though - at best you'll end up swapping to oblivion.

Can you use sparse arrays instead, where you only allocate space for elements you really need to access (or some approximation of that)? Or map the data to disk? If you give us more context, we may be able to come up with a solution.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Many elements of the array are `0`. (My Input is about 1 GB only.) But I have to access all of them. What would happen if I will try to access an empty element of a `sparse array`? – danijar Mar 05 '12 at 18:01
  • @share this: it depends on how you implement it, which will depend on your real requirements. – Jon Skeet Mar 05 '12 at 18:07
  • An int uses 4kb of RAM no matter what value it holds. – f2lollpll Mar 05 '12 at 18:08
  • Is there a way to use own datatypes so I don't have to waste RAM? – danijar Mar 05 '12 at 18:11
  • @sharethis: Yes, but we still don't have enough information to suggest a particular implementation. Do you know *anything* about how many values you'll have, what patterns they're likely to form? – Jon Skeet Mar 05 '12 at 18:29
  • @JonSkeet: I can't say how many relations I have. The relations are stored in a 1 GB XML-File. Maybe this helps a bit. (Counting all words of all German Wikipedia articles would give you the exact number.) My actual code doesn't finish so I can't tell you the number now. – danijar Mar 05 '12 at 19:06
  • @sharethis: It doesn't really. Why don't you write the code to extract the values, and then you'll be able to tell us something about it. For example, whether there are relatively few rows, each with many populated columns, or vice versa, or nothing like that at all. – Jon Skeet Mar 05 '12 at 19:17
  • OK, I will update my question. It's going to become big, too :-) – danijar Mar 05 '12 at 19:20
1

Jon and Henk have alluded to sparse arrays; this would be useful if many of your nodes are unrelated to each other. Even if all nodes are related to all others, you may not need an n by n array.

For example, perhaps nodes cannot be related to themselves. Perhaps, given nodes x and y, "x is related to y" is the same as "y is related to x". If both of those are true, then for 4 nodes, you only have 6 relations, not 16:

a <-> b
a <-> c
a <-> d
b <-> c
b <-> d
c <-> d

In this case, an n-by-n array is wasting somewhat more than half of its space. If large numbers of nodes are unrelated to each other, you're wasting that much more than half of the space.

One quick way to implement this would be as a Dictionary<KeyType, RelationType>, where the key uniquely identifies the two nodes being related. Depending on your exact needs, this could take one of several different forms. Here's an example based on the nodes and relations defined above:

Dictionary<KeyType, Relation> x = new Dictionary<KeyType, RelationType>();
x.Add(new KeyType(a, b), new RelationType(a, b));
x.Add(new KeyType(a, c), new RelationType(a, c));
... etc.

If relations are reflexive, then KeyType should ensure that new KeyType(b, a) creates an object that is equivalent to the one created by new KeyType(a, b).

phoog
  • 42,068
  • 6
  • 79
  • 117
  • Thanks for your answer. Nodes may relate to themselves. It is important for my simulation. And the relations have to be directed so `a->b != b->a`... – danijar Mar 05 '12 at 18:30
  • @sharethis Oops, so much for that idea! Still, a sparse array approach will be helpful if there are large numbers of nodes that are not related to one another. If there are *not*, then you'll most likely need to persist the relation information somewhere other than in memory, such as a file or a database table. – phoog Mar 05 '12 at 18:41