How can I read a Lync conversation file containing HTML?

Question

I'm having trouble reading a local file, into a string, in c#.

Here's what I came up with till now:

 string file = @"C:\script_test\{5461EC8C-89E6-40D1-8525-774340083829}.html";
 using (StreamReader reader = new StreamReader(file))
 {
      string line = "";
      while ((line = reader.ReadLine()) != null)
      {
           textBox1.Text += line.ToString();
      }
 }

And it's the only solution that seems to work.

I've tried some other suggested methods for reading a file, such as:

string file = @"C:\script_test\{5461EC8C-89E6-40D1-8525-774340083829}.html";
string html = File.ReadAllText(file).ToString();
textBox1.Text += html;

Yet it does not work as expected.

Here are the first few lines of the file i'm trying to read:

as you can see, it has some funky characters, honestly I don't know if that's the cause of this weird behavior.

But in the first case, the code seems to skip those lines, printing only "Document generated by Office Communicator..."

is that a binary data? You can read to a binary stream and convert to string. — Sin, Jul 27 '15 at 09:33
Please post the *binary* data from the start of the file - look at it with a hex file editor, basically. — Jon Skeet, Jul 27 '15 at 09:35
It looks like a simple html file, in fact it has an HTML tag, plus all other pieces as body, style, etc. When opened with chrome, it's a simple web page, with some garbage at the top. — user2340989, Jul 27 '15 at 09:35
Your 'html' file has some sort of binary header. The problem probably occurred earlier, when it was downloaded / generated. — H H, Jul 27 '15 at 09:36
How did you get this stream into the file? From Fiddler or something? The creator of the might also mess up your file. — Patrick Hofman, Jul 27 '15 at 09:36
It's generated by Lync, office communicator. I have a whole bunch of them. All the same. — user2340989, Jul 27 '15 at 09:37
That definitely looks like binary data. `NUL` there means U+0000 which is very rare in text and very common in binary. Likewise `ACK` which is U+0006. I'd guess it was a binary stream that also contained text (no rule against binary streams containing some text; no rule against binary streams containing anything), and `NUL` can often be interpreted in a text context as "Stop reading here"! — Jon Hanna, Jul 27 '15 at 09:38
Here's the binary data from the file: http://i.imgur.com/NeKRA9e.jpg — user2340989, Jul 27 '15 at 09:49
So, is this binary data that breaks the code?? If so, can i handle it in some way? — user2340989, Jul 27 '15 at 09:52
Here's the file if someone wants to fiddle with it: http://filebin.ca/2A3sSDhsYKdz — user2340989, Jul 27 '15 at 09:58
This isn't a text file. It's some other format that was saved with the ".html" extension. How was it generated *exactly*? The "generated by Lync" isn't an answer - what did Lync generate? Did you try to save something as an attachment? Is it the recording of a session? A transferred file? A raw file found in the user's Lync data folder? If you don't know the type or format of a binary file, you can't process it — Panagiotis Kanavos, Jul 27 '15 at 10:27
@Panagiotis Kanavos, i don't know how it's generated. It's a set of files holding Lync's conversation history. It's generated by the software. So basically i have no idea about this file. What i know about it, is that there's some html in it. :/ — user2340989, Jul 27 '15 at 10:46
That is the *most* significant piece of information, that should be in the title itself. I suggest you post a *new* question that asks how to read Lync conversation history files. There may be an API that makes this trivial. Also check Lync's documentation and programing guides. Make sure you mention the Lync version used. Also note that there are a lot of SO questions about reading Lync's history either from the server or the client. Make sure you specify the appropriate case — Panagiotis Kanavos, Jul 27 '15 at 11:04
The answer to _"How to read Lync transcripts generated by tool X"_ is _"By implementing a reader for the format that tool X writes them in"_, which is too broad. If the tool is open source, you may be able to reuse its code. — CodeCaster, Jul 27 '15 at 11:12
@Panagiotis Kanavos, ok i'll follow your suggestion and ask another question. So far, i wasn't able to find a piece of software to read local .hist files, most of it was server based. I'm doing this, because of unknown reasons(it's not my job, and i don't have rights to it) Lync is not able to save history in outlook. Now since i don't have access to server, i thought about writing some sort of parser to display current and future Lync history files on my pc, but... — user2340989, Jul 27 '15 at 12:28

rene · Answer 1 · 2015-07-27T17:35:20.860

Your task would be easier if you could use an API or the SDK or even would have a description of the format you try to read. However the binary format looks not to be that complicated and with an hexviewer installed I got this far to get the html out of the example you provided.

To parse non-text files you fall-back to the BinaryReader and then use one of the Read methods to read the correct type from the bytestream. I used ReadByte and ReadInt32. Notice how in the description of the method is explained how many bytes are read. That becomes handy when you try to decipher your file.

    private string ParseHist(string file)
    {
        using (var f = File.Open(file, FileMode.Open))
        {
            using (var br = new BinaryReader(f))
            {
                // read 4 bytes as an int
                var first = br.ReadInt32();
                // read integer / zero ended byte arrays as string
                var lead = br.ReadInt32();
                // until we have 4 zero bytes
                while (lead != 0)
                {
                    var user = ParseString(br);
                    Trace.Write(lead);
                    Trace.Write(":");
                    Trace.Write(user.Length);
                    Trace.Write(":");
                    Trace.WriteLine(user);
                    lead = br.ReadInt32();
                    // weird special case
                    if (lead == 2)
                    {
                        lead = br.ReadInt32();
                    }
                }

                // at the start of the html block
                var htmllen = br.ReadInt32();
                Trace.WriteLine(htmllen);
                // parse the html
                var html = ParseString(br);
                Trace.Write(len);
                Trace.Write(":");
                Trace.Write(html.Length);
                Trace.Write(":");
                Trace.WriteLine(html);
                // other structures follow, left unparsed

                return html.ToString();
            }
        }
    }

    // a string seems to be ascii encoded and ends with a zero byte.
    private static string ParseString(BinaryReader br)
    {
        var ch = br.ReadByte();
        var sb = new StringBuilder();
        while (ch != 0)
        {
            sb.Append((char)ch);
            ch = br.ReadByte();
        }
        return sb.ToString();
    }

You could use the simple parsing logic in a winform application as follows:

    private void button1_Click(object sender, EventArgs e)
    {
        webBrowser1.DocumentText = ParseHist(@"5461EC8C-89E6-40D1-8525-774340083829-Copia.html");
    }

Keep in mind that this is not bullet proof or the recommended way but it should get you started. For files that don't parse well you'll need to go back to the hexviewer and work-out what other byte structures are new or different from what you already had. That is not something I intend to help you with, that is left as an exercise for you to figure out.

Hi, I'm eager to try your code! I came up wit a solution, that consist of skipping the first n characters of the file. It seems tho, that only the textBox components are affected by this, displaying only a '-' character. The console output works fine... So, here's what i came up with: http://pastebin.com/gaEZW8ar http://i.imgur.com/X3eFdUQ.jpg I know it's not the best and safest solution, but it's a starting point. — user2340989, Jul 28 '15 at 08:35
Skipping n bytes is a guarantee for failure as the next file will be different with the names/address that seem to be in the start of the file — rene, Jul 28 '15 at 09:38
A similar type of data, with participants, is repeated at the end of file. The top portion is just a header and CSS markup. So I can still pull info from that file. Reading the file directly into a string, just does not work :/ — user2340989, Jul 28 '15 at 09:56
I was able to successfully parse and extract data with HtmlAgilityPack, as described here: http://stackoverflow.com/questions/19870116/using-htmlagilitypack-for-parsing-a-web-page-information-in-c-sharp — user2340989, Jul 28 '15 at 10:29

user2340989 · Accepted Answer · 2015-07-28T12:02:03.283

I don't know if it's the right way to answer this, but here's what I've managed to do so far:

        string file = @"C:\script_test\{1C0365BC-54C6-4D31-A1C1-586C4575F9EA}.hist";
                    string outText = "";
        //Encoding iso = Encoding.GetEncoding("ISO-8859-1");
        Encoding utf8 = Encoding.UTF8;
        StreamReader reader = new StreamReader(file, utf8);
        char[] text = reader.ReadToEnd().ToCharArray();
        //skip first n chars
        /*
        for (int i = 250; i < text.Length; i++)
        {
            outText += text[i];
        }
        */
        for (int i = 0; i < text.Length; i++)
        {
            //skips non printable characters
            if (!Char.IsControl(text[i]))
            {
                outText += text[i];
            }
        }
        string source = "";
        source = WebUtility.HtmlDecode(outText);
        HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
        htmlDoc.LoadHtml(source);

        string html = "<html><style>";
        foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//style"))
        {
            html += node.InnerHtml+ Environment.NewLine;
        }
        html += "</style><body>";
        foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//body"))
        {
            html += node.InnerHtml + Environment.NewLine;
        }
        html += "</body></html>";
        richTextBox1.Text += html+Environment.NewLine;

        webBrowser1.DocumentText = html;

The conversation displays correctly, both style and encoding.

So it's a start for me.

Thank you all for the support!

EDIT

Char.IsControl(char)

skips non printable characters :)

How can I read a Lync conversation file containing HTML?

2 Answers2

Linked