3

If the file is a text file, and StreamReader can figure out the Encoding it uses, how can I find out how much characters it has without reading the whole file?

I'm reading 1GB CSV files and it takes at least 4 seconds to read it with a StreamReader. File.ReadAllText().Length would cause System.OutOfMemoryException.

I imagine if I had the FileInfo(filename).Length and the Encoding, then I can calculate the number of characters.

Jader Dias
  • 88,211
  • 155
  • 421
  • 625
  • It rather depends on the type of enconding, `length` will give you the number of bytes which in ASCII would be the number of chars but for UTF/Unicode you can't know until its decoded. – Jodrell May 23 '11 at 19:39

5 Answers5

4

You can't. The reason is, some encoding (notably, UTF-8) have variable character width: some characters take up only 1 byte (ASCII), a lot take up 2 bytes, there are even cases with 3 or more bytes per character. Thus, without decoding the characters, it is impossible to know the length of the file under an encoding.

Also, all characters in C# strings are represented as UTF-16, AFAIK, so unless you have a very weird text (i.e. you're using many characters from outside plane 0), you can estimate the memory requirements in bytes rather easily, by multiplying the character count by 2 (and vice versa, estimate the number of characters by doubling the byte size).

Now, a better question is - why do you need the character count? What is it that you're doing with the CSV file later, that you want to load it all up into the memory, and why would knowing its size help?

Amadan
  • 191,408
  • 23
  • 240
  • 301
  • +1. However, it should be possible to write a method that counts the UTF-8 encoded characters in a file faster than a StreamReader with Encoding.UTF8 reads the characters. – dtb May 23 '11 at 19:44
  • And what about estimates? The first lines of a file could give me a estimate of the number of chars for the rest of file? – Jader Dias May 23 '11 at 19:44
  • @dtb I bet you can't, believe me I tried http://stackoverflow.com/questions/6101367/how-to-count-lines-fast/6101612#6101612 – Jader Dias May 23 '11 at 19:45
  • 1
    @Jader Dias: If you just need an estimate and your file doesn't contain too many non-ASCII characters, you can just use FileInfo.Length. – dtb May 23 '11 at 19:46
  • @dtb but all encodings use 1 byte for ASCII characters? – Jader Dias May 23 '11 at 19:47
  • @Jader Dias, yes, for characters in the range 0 to 127 all encodings use 1 byte. – Chris Haas May 23 '11 at 19:50
  • @Jader: Yes, all encodings currently in use map the first 128 chars in the same way (more or less), which is the definition of ASCII; and pretty much all encodings that were not ASCII-compatible (like EBCDIC) died off some decades ago (outside banks, I should add :D ), and were single-byte encodings, AFAIK. Ninja edit: Ugh, right, except for the UTF-16. :D – Amadan May 23 '11 at 19:51
  • 2
    @Chris Haas: No, not all. For example, UTF-16 uses two bytes for all characters (including those in the range U+0000 to U+007F) (four bytes in the case of surrogate pairs). UTF-32 uses four bytes for all characters. – dtb May 23 '11 at 19:52
  • @dtb: True, counting characters would be faster than reading them in, but marginally so. The biggest performance hit remains: you need to read the whole file, every single byte. – Amadan May 23 '11 at 19:52
1

For ASCII, CP-437, CP-1252, ISO-8859-1, or code pages similar to these, then the number of characters will be the number of bytes.

If the file is in UTF-16, then you cannot know the number of characters from the number of bytes, but it will likely be something similar to the number of bytes / 2. In any case, you can exactly calculate the size of memory needed to hold the file in a .NET string, because it will be the size of the file (since .NET uses UTF-16 internally) plus a constant overhead. The Length of such a string will be number of bytes divided by 2.

If the file is in UTF-8 (or any other vairable-width encoding), then the number of characters could be a wide range up to several times the number of bytes, or it could be one character per byte. It just depends on the data.

If the file is in UTF-32 (which is extremely unlikely), then the number of characters will be exactly the length of the file in bytes divided by four. But even though this is the exact number of characters, it does not indicate the length of the .NET string created from this file, since that might involve the use of surrogate code points for characters in the high planes, so the answer still depends on what you inted to do with the information.

Jeffrey L Whitledge
  • 58,241
  • 9
  • 71
  • 99
  • How to detect the file encoding then? – Jader Dias May 23 '11 at 20:28
  • 1
    @Jader Dias - Unless the file begins with a Byte Order Mark, then there is no reliable way to detect the encoding. There are heuristics that can be used to guess, but that is a whole other big question. (And "how many characters are in the file?" is a meaningless question if you do not know the file's encoding. Even reading the file will not tell you, if you cannot correctly read the file.) – Jeffrey L Whitledge May 23 '11 at 20:35
  • But I bet `StreamReader` uses encoding detection, even if it is not reliable. – Jader Dias May 23 '11 at 20:37
  • 1
    @Jader Dias - If you trust it for that, then use it. StreamReader has a CurrentEncoding property. It will not require you to read the entire file. – Jeffrey L Whitledge May 23 '11 at 20:40
0

I don't think it really can - some encodings encode characters with different number of bytes, so you'd really need to convert the bytes into characters to find the number of characters.

For example, in UTF-8, the characters from \u0000 to \u007F are represented in 1 byte only; between \0u0080 and \u07FF they need 2 bytes, and so on.

carlosfigueira
  • 85,035
  • 14
  • 131
  • 171
  • And what about estimates? The first lines of a file could give me a estimate of the number of chars for the rest of file? – Jader Dias May 23 '11 at 19:43
  • As long as you're satisfied with that estimate then go for it. If you get 90 characters out of 100 bytes then you can estimate 90% of your bytes will be characters. The more sample data you use the better your estimate will be. – Chris Haas May 23 '11 at 19:51
0

For some encodings this works (ASCII, Window 1262, IBM-850, etc), but not for UTF8 and UTF7, since they have some characters encoded as 1 byte, some as 2 (and I believe some even more as 2).

GvS
  • 52,015
  • 16
  • 101
  • 139
  • And how to estimate the number of characters? Could I know the the size in both bytes and chars for the first 100 lines of the file and then calculate the approximate number of chars for the file? How to do that? – Jader Dias May 23 '11 at 19:42
  • 1
    Depends on the contents. Suppose you have a UTF-8 file with English text, then translation in Japanese following it. English typically takes up 1 byte per char, Japanese 2 bytes per char. If you estimate by the top of the file, you'll get very wrong results. – Amadan May 23 '11 at 19:44
  • @Jader Dias, its a good way to presize a buffer and minimize reallocations. You could get fancy and keep a running estimate to improve your resize. – Jodrell May 23 '11 at 19:44
  • @Amadan that's not my case, I have a pretty regular CSV file – Jader Dias May 23 '11 at 19:45
  • I don't know what's pretty regular for you. I live in Japan, so my scenario is not too inconceivable to me :) – Amadan May 23 '11 at 19:54
0

The problem with this is if the file is UTF8 encoded then each character can occupy between 1 and 4 bytes, therefore you have no way of 'calculating' the number of characters without processing the file in some way.

Other encoding methods may prove more fruitful.

cusimar9
  • 5,185
  • 4
  • 24
  • 30