0

This may be the very basic question, but I am unable to find proper answer to my question. According to This post it says 1 byte can hold 1 character and according to this post it says if string is of 4 bytes it can store 2^31 -1 characters. I am confused if 1 char = 1 byte then string of 4 byte should hold only 4 character. ( I know I am wrong , but can anyone explain me , what is wrong in my math ? )

Community
  • 1
  • 1
Avinash Agrawal
  • 1,038
  • 1
  • 11
  • 17
  • It can hold a single value in that range, not that many values. – chrylis -cautiouslyoptimistic- Sep 20 '15 at 04:45
  • I am still confused, can u explain me with this example ,if i say string str = "abcde" which is 5 characters, and 5 character = 5 bytes. How this is valid ? – Avinash Agrawal Sep 20 '15 at 04:51
  • *The latter post says no such thing about 'a string of 4 bytes'*. The size of 2^31-1 given is rather *the maximum length of any String* (in Java), in characters. A string of 4 characters (in Java a character is not 'a byte') is of course, not the same string as a string of 2^31-1 characters. – user2864740 Sep 20 '15 at 05:02

1 Answers1

1

For the sake of discussion, lets assume 1 byte is 8 bits. Most systems follow this rule (though there are certainly systems where 1 byte is not 8 bits).

According to This post it says 1 byte can hold 1 character

That link talks about strings in an MYSQL database, though what it says applies to any system that supports 7-bit ASCII characters in general. In this regard, 1 byte = 1 character, yes.

8-bit characters, on the other hand, introduce more complexity. For ASCII characters, which only require 7 bits, 1 byte = 1 character. But for non-ASCII characters, 1 byte may or may not represent a full Unicode character, depending on the charset used to encode the string.

For example, (Unicode codepoint U+20AC EURO SIGN) takes 1 byte when encoded in Windows-125X charsets (0x88 in Windows-1251, 0x80 in Windows-1252 through Windows-1258), but takes 3 bytes when encoded in UTF-8 (0xE2 0x82 0xAC), even though they are all 8-bit encodings (in comparison, UTF-16, which is a 16bit encoding, encodes U+20AC using 2 bytes, 0xAC 0x20 or 0x20 0xAC, depending on the endian used).

according to this post it says if string is of 4 bytes it can store 2^31 -1 characters

That link talks about strings in Java, though what it says applies to any system that supports variable-length strings that use a 32-bit signed integer to represent the string's length.

The link does not say anything about a 4-byte string holding 2^31 -1 characters. What it actually says is that a string can hold up to a maximum of 2^31 -1 characters. That is the highest value of a 32-bit signed integer.

I am confused if 1 char = 1 byte then string of 4 byte should hold only 4 character.

For a 7-bit ASCII string, or an 8-bit ANSI/UTF-8 string that encodes 4 Unicode codepoints using 4 bytes, yes.

You have to take the string's byte encoding into account to know what the bytes of the string actually represent.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • Ok , I am getting some idea that it depends upon the type of encoding, but in every day programming ex:java,python,ruby etc, we store very long characters in string type: for example String str = "Stackoverflow rocks !!" which works without any compilation error. what type of encoding it uses ? – Avinash Agrawal Sep 20 '15 at 05:11
  • Can u provide a link, what's string's byte encoding represents ? Further I feel that we should talk more about encoding fundamental ,as what we learn in schools ,books is not applied to modern day programming . – Avinash Agrawal Sep 20 '15 at 05:24
  • That depends on the particular programming language and compiler implementation. Some systems use 8bit ANSI/UTF8 strings (eg: C/C++ when using `char`-based strings), but most systems use 16bit UTF-16 strings nowadays (eg: C/C++ when using 16bit `wchar_t`- or `char16_t`- based strings, Java, C#, etc). Read the documentation for your chosen programming language(s). – Remy Lebeau Sep 20 '15 at 05:25
  • Python is an ugly beast when it comes to string handling. Its default encoding for its `str` type is ASCII, but it can also hold other 8bit encodings. It also has a `unicode` type for storing Unicode strings, using either UTF-16 or UTF-32 encoding, depending on how the Python interpreter is compiled. There are a lot of questions on StackOverflow related to people having troubles with string encodings in Python. – Remy Lebeau Sep 20 '15 at 05:27
  • "*what's string's byte encoding represents*" - well, first you need to define which programming language you want to know that about. Different languages store strings differently. – Remy Lebeau Sep 20 '15 at 05:28
  • [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html) – Remy Lebeau Sep 20 '15 at 05:29
  • Hi Remy, I read about encodings but that's the way of encoding your characters, for example letter A is assigned some utf+ codepoint which is used further in translation or maintaining font style but here we are talking about how can java for example can store long strings in it's string data type of 4 bytes. I am not sure how to relate encoding style; with memory. – Avinash Agrawal Sep 20 '15 at 05:48
  • Further if we talk about UTF-16 it used 2 bytes per code point which mean s letter A needs 2 bytes of storage, which mean java string type can only store 2 characters. I know I am wrong but can you explain ? – Avinash Agrawal Sep 20 '15 at 05:55
  • @all see this link as well http://stackoverflow.com/questions/15369117/what-is-the-maximum-amount-of-data-that-a-string-can-hold-in-java – Avinash Agrawal Sep 20 '15 at 05:59
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/90144/discussion-between-remy-lebeau-and-avinash-agrawal). – Remy Lebeau Sep 20 '15 at 06:14