0

I have read the data from an .docx file using stream reader and got the content in string and printed using Console.writeLine. This content is not same as that of content which i got using File.ReadAllBytes function for the same file.

And the codes are shown below

// first code

StreamReader streamReader = new StreamReader("D:\\sample.docx");
String text = streamReader.ReadToEnd();
Console.WriteLine(streamReader.CurrentEncoding);//it shows the ecoding as UTF8
byte[] array = Encoding.UTF8.GetBytes(text)
File.WriteAllBytes("D:\\file3.txt", array);

This is my output when I used the above code

PK     ! ߤ�lZ      [Content_Types].xml �(�                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ���n�0E�����Ub袪*�>�-R�{V��Ǽ��QU�
l"%3��3Vƃ�ښl    �w%�=���^i7+���-d&�0�A�6�l4��L60#�Ò�S
O����X� �*��V$z�3��3������%p)O�^����5}nH"d�s�Xg�L�`���|�ԟ�|�P�rۃs�?�PW��tt4Q+��"�wa���|T\y���,N���U�%���-D/��ܚ��X�ݞ�(���<E��)�� ;�N�L?�F�˼��܉��<Fk� �h�y����ڜ���q�i��?�ޯl��i� 1��]�H�g��m�@����m�  �� PK     ! ���   N   _rels/.rels �(�                    

// second code

byte[] x = File.ReadAllBytes("D:\\sample.docx");
File.WriteAllBytes("C:\\file3.txt", x);

Both the file contents are different. Is there any possible way of my first code to get the same content as that of second code?

This is my output when used ReadAllBytes

PK     ! ߤÒlZ      [Content_Types].xml ¢(                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ´”ËnÂ0E÷•ú‘·Ub袪*‹>–-Ré{Vý’Ǽþ¾QU‘
l"%3÷Þ3VƃÑÚšl  µw%ë=–“^i7+Ù×ä-d&á”0ÞAÉ6€l4¼½L60#µÃ’ÍS
Oœ£œƒXø Ž*•V$z3„ü3à÷½Þ—Þ%p)Oµ^ “²×5}nH"dÙsÓXg•L„`´‰ê|éÔŸ”|—PrÛƒsðŽ?˜PWŽìtt4Q+ÈÆ"¦wa©‹¯|T\y¹°¤,NÛàôU¥%´úÚ-D/‘ÎÜš¢­X¡Ýžÿ(¦¼<EãÛ)‘à ;çN„L?¯Fñ˼¤¢Ü‰˜¸<FkÝ  ‘h¡yöÏæØÚœŠ¤Îqôi£ã?ÆÞ¯l­Îià 1éÓ]›HÖgÏõm @ÈæÛûmø  ÿÿ PK     ! ‘·ï   N   _rels/.rels ¢(          
wazza
  • 770
  • 5
  • 17
  • 42
  • what do you mean by the content being different? Are you talking about fonts etc? – user1666620 Jan 05 '15 at 12:06
  • 2
    It's not a text file. if you wanna read a word document use Interop or there are libraries for that, use one of them. – Selman Genç Jan 05 '15 at 12:08
  • when I read using bytes....I got some binary data but when I done it using string it shows some of the data with ??? – wazza Jan 05 '15 at 12:09
  • @Selman22..thank u...but is there any way to get the same content – wazza Jan 05 '15 at 12:10
  • 1
    What are you trying to do? Why are you not just using File.Copy()? Also, try googling for "encoding". – helb Jan 05 '15 at 12:14
  • @helb...I have a default(built-in) function which converts .docx content to txt content which takes byte as parameter. Also I have built-in function which reads the contents as string. Now I want to convert that string to correct bytes so that I can pass through that function – wazza Jan 05 '15 at 12:19
  • [`File.ReadAllBytes()`](http://msdn.microsoft.com/en-us/library/system.io.file.readallbytes). – Uwe Keim Jan 05 '15 at 12:22
  • I suggest you look in "interop" as suggested above. you can read and write word documents from C# using special libraries that 'know' the internal format of the document. See for example: http://www.techrepublic.com/blog/how-do-i/how-do-i-modify-word-documents-using-c/ Also, there are tools that don't require word to be installed. – NoChance Jan 05 '15 at 12:27
  • 1
    The difference between the two sections of code is fundamental. In the first one you are using a StreamReader which by design converts the read in data to a C# string. Unfortunately for you the data in a docx file is not string data, so you are converting binary data to a string, and later trying to convert it back to bytes (i.e. binary) data. If you wish to read the data into a stream and write it out again - use BinaryReader instead – NotJarvis Jan 05 '15 at 12:27
  • @NotJarvis...As I mentioned in the above comment...I have built in function which uses stream reader and also function which converts docx to txt content which takes byte as parameter...I am doing the connversion of string to bytes so that to use that function. Is there any possibility to achieve that – wazza Jan 05 '15 at 12:33
  • `StreamReader` is meant for text files, not binary files, so you shouldn't be using it in the first place. Check out these threads: http://stackoverflow.com/questions/6491305/streamreader-and-binary-data http://stackoverflow.com/questions/10353913/streamreader-vs-binaryreader – Andrew Jan 05 '15 at 12:56
  • A docx is a zipped binary file. You should either use Office libraries or perhaps use [System.IO.Packaging](http://msdn.microsoft.com/en-us/library/system.io.packaging.package%28v=vs.110%29.aspx) to access the content files inside. – crashmstr Jan 05 '15 at 13:39

1 Answers1

0

To read data from word you should use Microsoft word interop.

Below is the example shows how to read data from word.

Add Microsoft.Office.Interop.Word reference.

Application application = new Application();

// Open a doc file.
Document document = application.Documents.Open("D:\\Test.docx");

String read = string.Empty;
List<string> data = new List<string>();

for (int i = 0; i < document.Paragraphs.Count; i++)
{
    string temp = document.Paragraphs[i + 1].Range.Text.Trim();
    if (temp != string.Empty)
        data.Add(temp);
}

foreach (var item in data)
{
    Console.WriteLine(item);
}

// Close word.
application.Quit();
Gun
  • 1,400
  • 1
  • 10
  • 16