7

Today I noticed that C#'s String class returns the length of a string as an Int. Since an Int is always 32-bits, no matter what the architecture, does this mean that a string can only be 2GB or less in length?

A 2GB string would be very unusual, and present many problems along with it. However, most .NET api's seem to use 'int' to convey values such as length and count. Does this mean we are forever limited to collection sizes which fit in 32-bits?

Seems like a fundamental problem with the .NET API's. I would have expected things like count and length to be returned via the equivalent of 'size_t'.

Andrew
  • 107
  • 1
  • 4
  • 21
    If my answer was a 2GB string, I might take another look at the problem. – Anthony Pegram Jun 24 '10 at 02:54
  • 2
    Nitpick: since .NET encodes characters with UTF-16, allocating (at least) two bytes for each character, a string of maximum length would have 2^31 characters, and consume at least **4GB** or memory, not **2GB**. – Michael Petrotta Jun 24 '10 at 03:04
  • @Michael -- An int is signed, meaning the maximum length is 2GB. – Andrew Jun 24 '10 at 03:06
  • 1
    @Andrew: you are partially correct. The maximum length of a string is 2^31 *characters*, but as I discuss, that string will consume at least 2^32 *bytes*. "GB" is a unit of bytes, not characters. – Michael Petrotta Jun 24 '10 at 03:08
  • I don't quite see why this was downvoted to oblivion. I think it's a perfectly reasonable question. – Earlz Jun 24 '10 at 03:08
  • what sort of data manipulation can you hope to do if you could hold a 2+ GB strings ? – Egon Jun 24 '10 at 03:08
  • 4
    @Egon, a concatenated string of all my Facebook friends. – Anthony Pegram Jun 24 '10 at 03:10
  • We don't need to get too caught up with strings here. I was just using the string example to get peoples attention. This 'int' limitation applies to most .NET API's -- they tend to return things like length / count as type int. – Andrew Jun 24 '10 at 03:19
  • @ Michael -- sorry, I misunderstood. You are correct, the string would consume 4GB :) – Andrew Jun 24 '10 at 03:19
  • 2
    @Andrew: then you should change your question to reflect that. Your question as written entirely concerns strings, and really isn't reasonable. Talking about other objects in the framework - that would make much more sense. – Michael Petrotta Jun 24 '10 at 03:21
  • @Andrew: Also check a post with some rationale for using signed 32 bit ints as indexers: http://stackoverflow.com/questions/3060057/unsigned-versus-signed-numbers-as-indexes – simendsjo Jun 25 '10 at 06:53

8 Answers8

16

Seems like a fundamental problem with the .NET API...

I don't know if I'd go that far.

Consider almost any collection class in .NET. Chances are it has a Count property that returns an int. So this suggests the class is bounded at a size of int.MaxValue (2147483647). That's not really a problem; it's a limitation -- and a perfectly reasonable one, in the vast majority of scenarios.

Anyway, what would the alternative be? There's uint -- but that's not CLS-compliant. Then there's long...

What if Length returned a long?

  1. An additional 32 bits of memory would be required anywhere you wanted to know the length of a string.
  2. The benefit would be: we could have strings taking up billions of gigabytes of RAM. Hooray.

Try to imagine the mind-boggling cost of some code like this:

// Lord knows how many characters
string ulysses = GetUlyssesText();

// allocate an entirely new string of roughly equivalent size
string schmulysses = ulysses.Replace("Ulysses", "Schmulysses");

Basically, if you're thinking of string as a data structure meant to store an unlimited quantity of text, you've got unrealistic expectations. When it comes to objects of this size, it becomes questionable whether you have any need to hold them in memory at all (as opposed to hard disk).

Dan Tao
  • 125,917
  • 54
  • 300
  • 447
  • 4
    I don't see how it's reasonable. Since .NET defines an int to be 32 bits, that means 50 years from now...no matter what my computer can handle, .NET will be restricting me to 32-bit size collections. Sounds like a modern variation of '640Kb is enough for anyone'. – Andrew Jun 24 '10 at 03:05
  • 6
    @Andrew, in 50 years, you won't be programming in .NET. And in 50 years, int.MaxValue would still be a large number of objects to hold in a collection. – Anthony Pegram Jun 24 '10 at 03:07
  • 2
    @Andrew then create a wrapper around a multidimensional `List<>/Array` and have it return a `Int64` for `Count` – Earlz Jun 24 '10 at 03:08
  • 1
    Seems like a stupid arbitrary limitation. C handles this much better. – Andrew Jun 24 '10 at 03:10
  • 4
    The problem with "640Kb" is that it was obsolete in a very short time. In contrast, 50 years is a very long time in this industry. Vast majority of languages and technologies in use today did not exist 50 years ago, and most technologies in use back then did not survive to see this day (indeed, C, ancient as it is among its peers today, is only 38 years old). I don't think .NET string length limits will be a concern by that time. – Pavel Minaev Jun 24 '10 at 03:10
  • @Dan Tao's edit: This isn't true. C handles these scenarios very well with the 'size_t' type. – Andrew Jun 24 '10 at 03:21
  • 1
    @Andrew: You have to evaluate this particular fact in the context of the CLS as a whole, though. Maybe in 50 years it will seem absurd to cap strings at ~2 billion characters because we'll be absolutely swimming in memory; I don't know. But what seems far *more* relevant is whether or not 2 billion (or even 9 quintillion) will seem a reasonable cap on an integral data type. If those limits are no longer practical, then the CLS as it exists today will not be around anymore. – Dan Tao Jun 24 '10 at 03:25
  • It will be far less than 50 years before this assumption is obsolete. My computer's RAM grew more than 5 orders of magnitude in the past 20 years. I remember when I couldn't imagine how I'd ever use 64MB of RAM, yet today I don't think twice about loading a mere 64MB text file into a string for processing. – Ken Jun 24 '10 at 05:11
  • @Ken Yes, that's true. But let me compare the history of total process memory available and how much some high consuming applications used. When you had 64MB memory, you used 60MB of it, when you had 512MB memory, you used 400MB of it, when you had 2GB (32-bit), you used 1.5GB, now you have 8GB you use maybe 3GB. At some point your requirement for memory is going to end. And if you to handle a string larger than 2GB, you'll most likely use buffer with indexes instead of `System.String`. Or even more likely a lower level language. – Aidiakapi Jan 21 '13 at 14:52
5

Correct, the maximum length would be the size of Int32, however you'll likely run into other memory issues if you're dealing with strings larger than that anyway.

Evan Trimboli
  • 29,900
  • 6
  • 45
  • 66
  • This applies to more than string though. It applies to most all collections. – Andrew Jun 24 '10 at 03:04
  • 1
    @Andrew - The answer covers that statement too. If you have a collection approaching 2 GB you are going to have other issues as well. – David Basarab Jun 24 '10 at 03:05
  • Suppose it's the year 2060 and I'm working on an application on my ultra-modern PC which requires collections with more than an int's worth of items. What problems might I have? – Andrew Jun 24 '10 at 03:09
  • 4
    @Andrew, first of all using .NET in 2060 is a problem. – Joseph Yaduvanshi Jun 24 '10 at 03:12
  • 3
    @Jim Schubert, I bet someone said the same thing about using COBOL in 2010 :) – Giovanni Galbo Jun 24 '10 at 03:25
  • @Giovanni: by 2060, I hope IT managers will have learned from their mistakes. Dijkstra knew it in the '70s: "The use of COBOL cripples the mind; its teaching should, therefore, be regarded as a criminal offense." I'm sure COBOL will still be used in 2060, since most IT departments are slower to make decisions than Congress. – Joseph Yaduvanshi Jun 24 '10 at 12:34
  • @Jim Schubert -- Dijkstra was almost a cartoon example of a "Computer Scientist". COBOL was a great language for getting the sort of things computers did in the 1970s done. Today's CS professors hate PHP but pragmatic programmers know there is no quicker way to implement a web server. – James Anderson Jan 14 '16 at 08:28
  • 1
    @JamesAnderson I've never met a pragmatic programmer that would choose PHP over any other option (including static HTML). We must run in different circles. I've done PHP professionally and hated every second of it. – Joseph Yaduvanshi Jan 14 '16 at 18:57
2

At some value of String.length() probably about 5MB its not really practical to use String anymore. String is optimised for short bits of text.

Think about what happens when you do

msString += " more chars"

Something like:

System calculates length of myString plus length of " more chars"

System allocates that amount of memory

System copies myString to new memory location

System copies " more chars" to new memory location after last copied myString char

The original myString is left to the mercy of the garbage collector.

While this is nice and neat for small bits of text its a nightmare for large strings, just finding 2GB of contiguous memory is probably a showstopper.

So if you know you are handling more than a very few MB of characters use one of the *Buffer classes.

James Anderson
  • 27,109
  • 7
  • 50
  • 78
1

In versions of .NET prior to 4.5, the maximum object size is 2GB. From 4.5 onwards you can allocate larger objects if gcAllowVeryLargeObjects is enabled. Note that the limit for string is not affected, but "arrays" should cover "lists" too, since lists are backed by arrays.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
1

It's pretty unlikely that you'll need to store more than two billion objects in a single collection. You're going to incur some pretty serious performance penalties when doing enumerations and lookups, which are the two primary purposes of collections. If you're dealing with a data set that large, There is almost assuredly some other route you can take, such as splitting up your single collection into many smaller collections that contain portions of the entire set of data you're working with.

Heeeey, wait a sec.... we already have this concept -- it's called a dictionary!

If you need to store, say, 5 billion English strings, use this type:

Dictionary<string, List<string>> bigStringContainer;

Let's make the key string represent, say, the first two characters of the string. Then write an extension method like this:

public static string BigStringIndex(this string s)
{
    return String.Concat(s[0], s[1]);
}

and then add items to bigStringContainer like this:

bigStringContainer[item.BigStringIndex()].Add(item);

and call it a day. (There are obviously more efficient ways you could do that, but this is just an example)

Oh, and if you really really really do need to be able to look up any arbitrary object by absolute index, use an Array instead of a collection. Okay yeah, you use some type safety, but you can index array elements with a long.

Warren Rumak
  • 3,824
  • 22
  • 30
  • Even if you could index into an array with a `long` it would currently be pretty useless: The CLR has a max object size limit of 2GB, so it's impossible for an array to have more than `int.MaxValue` elements anyway (and it could only get near that limit if it was a `bool[]` or `byte[]` array with single-byte elements). *This restriction applies to Microsoft's current implementation, I'm not sure about Mono.* – LukeH Jun 25 '10 at 06:39
1

The fact that the framework uses Int32 for Count/Length properties, indexers etc is a bit of a red herring. The real problem is that the CLR currently has a max object size restriction of 2GB.

So a string -- or any other single object -- can never be larger than 2GB.

Changing the Length property of the string type to return long, ulong or even BigInteger would be pointless since you could never have more than approx 2^30 characters anyway (2GB max size and 2 bytes per character.)

Similarly, because of the 2GB limit, the only arrays that could even approach having 2^31 elements would be bool[] or byte[] arrays that only use 1 byte per element.

Of course, there's nothing to stop you creating your own composite types to workaround the 2GB restriction.

(Note that the above observations apply to Microsoft's current implementation, and could very well change in future releases. I'm not sure whether Mono has similar limits.)

LukeH
  • 263,068
  • 57
  • 365
  • 409
  • do you have any references for this? – Russell Jun 25 '10 at 06:57
  • @Russell: "As with 32-bit Windows operating systems, there is a 2GB limit on the size of an object you can create while running a 64-bit managed application on a 64-bit Windows operating system." http://msdn.microsoft.com/en-us/library/ms241064.aspx – LukeH Jun 25 '10 at 08:47
  • 1
    @Russell: There's also an interesting blog article here, with an example of a workaround composite object: http://blogs.msdn.com/b/joshwil/archive/2005/08/10/450202.aspx – LukeH Jun 25 '10 at 08:48
  • @Russell: And a couple of interesting SO discussions: http://stackoverflow.com/questions/1087982/single-objects-still-limited-to-2-gb-in-size-in-clr-4-0 and http://stackoverflow.com/questions/573692/is-the-size-of-an-array-constrained-by-the-upper-limit-of-int-2147483647 – LukeH Jun 25 '10 at 08:49
0

Even in x64 versions of Windows I got hit by .Net limiting each object to 2GB.

2GB is pretty small for a medical image. 2GB is even small for a Visual Studio download image.

Windows programmer
  • 7,871
  • 1
  • 22
  • 23
  • 1
    This is my concern. It seems like most of the API's .NET provides use an int for things like 'count' or 'length'. – Andrew Jun 24 '10 at 03:16
  • 1
    @Michael - I don't care so much about strings in particular, it was just an example to get people attention. – Andrew Jun 24 '10 at 03:16
  • 1
    Seems like someone hit that problem with `Array` early on, since it has a 64-bit `LongLength` property. – devstuff Jun 24 '10 at 03:26
  • @devstuff: In Microsoft's implementation, `LongLength` just returns the 32-bit `Length` cast to a `long`! Besides, the CLR's 2GB object size restriction means that the only arrays that could get anywhere near having `int.MaxValue` elements would be `bool[]` or `byte[]`. (I'm not sure if Mono is subject to the same restrictions.) – LukeH Jun 24 '10 at 06:37
  • Even if one wanted to store a 65536x65536-pixel 64-bit color image (16GB), that wouldn't imply that one should store it as a single `ColorPixel[65536,65536]`. Subdividing into smaller objects would seem to make more sense. Further, even if we do reach the point where using a single monolithic data structure over 4GB would be more efficient than using a nested collection of sub-4GB objects, I don't know that we'll ever reach the point where subdividing things into objects below 4GB would impede performance significantly. – supercat Jul 10 '12 at 16:27
  • The OP doesn't suggest string is the correct type in which to store a Visual Studio download image. Stream might be better. – maxwellb Jan 13 '16 at 21:33
0

If you are working with a file that is 2GB, that means you're likely going to be using a lot of RAM, and you're seeing very slow performance.

Instead, for very large files, consider using a MemoryMappedFile (see: http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx). Using this method, you can work with a file of nearly unlimited size, without having to load the whole thing in memory.

Robert Seder
  • 1,390
  • 1
  • 9
  • 19