3

I need to read the first line from a stream to determine file's encoding, and then recreate the stream with that Encoding

The following code does not work correctly:

var r = response.GetResponseStream();
var sr = new StreamReader(r);
string firstLine =  sr.ReadLine();
string encoding = GetEncodingFromFirstLine(firstLine);
string text = new StreamReader(r, Encoding.GetEncoding(encoding)).ReadToEnd();

The text variable doesn't contain the whole text. For some reason the first line and several lines after it are skipped.

I tried everything: closing the StreamReader, resetting it, calling a separate GetResponseStream... but nothing worked.

I can't get the response stream again as I'm getting this file from the internet, and redownloading it again would be bad performance wise.

Update

Here's what GetEncodingFromFirstLine() looks like:

public static string GetEncodingFromFirstLine(string line)
{
    int encodingIndex = line.IndexOf("encoding=");
    if (encodingIndex == -1)
    {
        return "utf-8";
    }
    return line.Substring(encodingIndex + "encoding=".Length).Replace("\"", "").Replace("'", "").Replace("?", "").Replace(">", "");
}

...

// true
Assert.AreEqual("windows-1251", GetEncodingFromFirstLine(@"<?xml version=""1.0"" encoding=""windows-1251""?>")); 

** Update 2 **

I'm working with XML files, and the text variable is parsed as XML:

var feedItems = XElement.Parse(text);
Alex
  • 34,581
  • 26
  • 91
  • 135

3 Answers3

6

Well you're asking it to detect the encoding... and that requires it to read data. That's reading it from the underlying stream, and you're then creating another StreamReader around the same stream.

I suggest you:

  • Get the response stream
  • Retrieve all the data into a byte array (or MemoryStream)
  • Detect the encoding (which should be performed on bytes, not text - currently you're already assuming UTF-8 by creating a StreamReader)
  • Create a MemoryStream around the byte array, and a StreamReader around that

It's not clear what your GetEncodingFromFirstLine method does... or what this file really is. More information may make it easier to help you.

EDIT: If this is to load some XML, don't reinvent the wheel. Just give the stream to one of the existing XML-parsing classes, which will perform the appropriate detection for you.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • @Skomski: Thanks - that makes it simpler, as the OP doesn't need to do the work at all. – Jon Skeet Aug 09 '11 at 12:16
  • Jon, thank you for your answer! I already tried detecting encoding on bytes, but it didn't work. I've updated the post with the GetEncodingFromFirstLine() method. – Alex Aug 09 '11 at 12:18
  • @Alex: You *must* detect the encoding based on the bytes, if you actually want to detect it at all. You will have lost information by creating a `StreamReader` around the bytes - because it will *impose* UTF-8 or UTF-16. However, see my edit - you don't need to do any of this at all. Just give the binary data to an XML parser. – Jon Skeet Aug 09 '11 at 12:19
  • you are right, I'm parsing XML. However when parsing xml files with Cyrillic characters I receive unreadable stuff, because those files use Windows cp1251 encoding. I need to detect the encoding and based on that encoding create the StreamReader. – Alex Aug 09 '11 at 12:29
  • @Alex: But the XML parser should be handling that for you, unless you're saying that the XML files are fundamentally *broken* to start with. The whole point of an XML file declaring its encoding is to allow the parser to hide all this from you. – Jon Skeet Aug 09 '11 at 12:31
  • That's interesting... When I parse the following XML file as is, I receive unreadable characters: http://feeds.feedburner.com/membrana_ru – Alex Aug 09 '11 at 12:39
  • @JonSkeet let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/2268/discussion-between-alex-and-jon-skeet) – Alex Aug 09 '11 at 12:39
  • @Alex: No, I don't have time to devote to a 1-to-1 chat session. I'm doing lots of things at the same time - comments work much better for me. – Jon Skeet Aug 09 '11 at 12:40
  • @Alex: If that *HTML* is actually valid *XML*, it certainly doesn't use an XML declaration to specify the character encoding. It's got a charset within the `meta` tag, but that's quite different. – Jon Skeet Aug 09 '11 at 12:41
  • When I parse the following XML file as is, I receive unreadable characters: http://feeds.feedburner.com/membrana_ru This is a screenshot of what I receive when I try to print the title of the first element to console: http://dl.dropbox.com/u/14148878/… And this is the correct title: http://dl.dropbox.com/u/14148878/… I specified the encoding when creating the stream reader: var sr = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding("windows-1251")); – Alex Aug 09 '11 at 12:49
  • @Alex: I can't view either of those files. But does the web response not indicate the content type in the headers? – Jon Skeet Aug 09 '11 at 12:51
2

You need to change the current position in the stream to the beginning.

r.Position = 0;
string text = new StreamReader(r, Encoding.GetEncoding(encoding)).ReadToEnd();
Jakub Konecki
  • 45,581
  • 7
  • 87
  • 126
  • 1
    The stream comes from an HTTP response, so it's not seekable. And the OP wants to detect the encoding, whereas your code assumes the encoding is already known – Thomas Levesque Aug 09 '11 at 12:18
1

I found the answer to my question here:

How can I read an Http response stream twice in C#?

Stream responseStream = CopyAndClose(resp.GetResponseStream());
// Do something with the stream
responseStream.Position = 0;
// Do something with the stream again


private static Stream CopyAndClose(Stream inputStream)
{
const int readSize = 256;
byte[] buffer = new byte[readSize];
MemoryStream ms = new MemoryStream();

int count = inputStream.Read(buffer, 0, readSize);
while (count > 0)
{
    ms.Write(buffer, 0, count);
    count = inputStream.Read(buffer, 0, readSize);
}
ms.Position = 0;
inputStream.Close();
return ms;
}
Community
  • 1
  • 1
Alex
  • 34,581
  • 26
  • 91
  • 135