Does it make sense to "waste" 8 bytes per String instance for offset/count?

Question

Strings in Java support structural sharing for some methods like substring, which means that supposedly immutable data doesn't need to be copied (which (unexpectedly) keeps large char arrays alive which would have been GC'd otherwise.)

This feature is implemented with two fields offset and count which are set accordingly when a String is substringed in Java.

Considering that .NET doesn't do this and claims that "O(n) is O(1) if n does not grow large", would a slightly different design of Strings make sense which accommodates both requirements?

E. g. would it make sense to have a sealed, memory-efficient, general purpose version of String which doesn't have these superfluous fields and a subclass "SubString" which is only returned by substring methods and has the additional fields to avoid copying?

Rough sketch:

sealed class String {
  val codeunits: Array[Char] = ...
  def length = codeunits.length

  def substring: SubString = ...

  ...
}

final class SubString extends String {
  val offset: Int = ...
  override def length = codeunits.length - offset /* and so on */

  ...
}

The field is called count, not length, at least in the Oracle JVM. — Michael Borgwardt, Sep 16 '11 at 14:35
I think you want to limit tags to java/scala since C# as you mention does not have the thing you propose to work around/replace — sehe, Sep 16 '11 at 14:36
@sehe: I'm especially interested what experiences .NET devs have made with their choice, that's why I left the tag in. — soc, Sep 16 '11 at 14:38
In theory it should be possible for a sufficiently-smart JVM to automatically shrink the array behind your back, but it might not be practical or worthwhile. — Stuart Cook, Sep 16 '11 at 14:38
@soc - that's not the purpose of tagging. Tagging is used to classify the post as to topic, not to attract the attention of a particular audience. You'd need to have more more .NET content for the question to be tagged .NET (or C#), IMO so I've removed that tag. NOTE: a question that did try to compare Java/C#/Scala string implementations would have to be crafted carefully to avoid it becoming an unproductive discussion. I think it's best left as a Java-only question. — tvanfosson, Sep 16 '11 at 14:43
Ok, agreed. I added the tags "memory usage" and "performance" instead. — soc, Sep 16 '11 at 14:47
Are there any good benchmarks showing how each model stacks up under similar load in terms of memory and processor usage? Seems like that would be your starting place to see if there's enough difference to warrant a mixed model. It might also show where each model ought to be used. (Keep in mind Java currently allows both models, you just have to choose which one to use, and I'm not so sure it's wasting as many bytes as you think. Length is already required by C# strings and there's going to be a starting memory address for any string) — , Sep 16 '11 at 14:51
If I could figure out the reason why people vote to close, I could improve my question ... — soc, Sep 16 '11 at 14:54
If 8 bytes matter to you, chances are you either create far too many tiny strings or should be programming in C. — , Sep 16 '11 at 14:56
I am to reply to you shortly: having a subclass will not be a better option b/c virtual method (override) w/ more than a single call site cannot be inlined effectively. 2 call sites does inline effective, yet it will lead to code bloat (and +extra type check). In the end you'd like string operations like indexOf to be inlined into SSE instructions. — bestsss, Sep 16 '11 at 16:37

score 2 · Accepted Answer · answered Sep 16 '11 at 15:25

2

What you suggest could make the common case more efficient in terms of memory and cpu.

You may be interested to know the JVM can change this without a code change. The Sun/Oracle JVM currently uses a byte[] automagically when the characters fit into bytes without loss.

In any case its the sort of thing you would want the JVM to do for you transparently, like -XX:+UseCompressedStrings does.

answered Sep 16 '11 at 15:25

Peter Lawrey

525,659
79
751
1,130

Thanks, but isn't that something completely different? Having a byte[] instead of a char[] doesn't change the behavior in question ... – soc Sep 16 '11 at 15:52
It shows that a change for performance reasons is possible, has happened relatively recently, and even desirable provided its transparent to the Java application. – Peter Lawrey Sep 16 '11 at 15:58

Does it make sense to "waste" 8 bytes per String instance for offset/count?

1 Answers1