How to format XML content of DOCX file to not to split by different languages?

Question

I am trying to parse the DOCX document in C#. When I read the content of DOCX and convert it to XML, I have an issue of automatic language assignment based on the content.

For example. In the word document, I have a sentence like $ 100,000.00.

When I convert this to XML format, the output is:

        <w:r>
            <w:rPr>
                <w:rFonts w:hint="default" w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
                <w:sz w:val="22"/>
                <w:szCs w:val="28"/>
                <w:lang w:val="en-US"/>
            </w:rPr>
            <w:t xml:space="preserve">$ </w:t>
        </w:r>
        <w:r>
            <w:rPr>
                <w:rFonts w:hint="default" w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
                <w:sz w:val="22"/>
                <w:szCs w:val="28"/>
                <w:lang w:val="en-GB"/>
            </w:rPr>
            <w:t>100,000.00.</w:t>
        </w:r>

If there is another language, it also splits them based on the language, like es-ES, en-US etc.

I am using C# to replace the words in the DOCX file. To this replacement, I use WordProcessingDocument library. As far as I understood, this library does the parsing and streaming the document in OpenXMLformat. Which is why when I print out the content of the DOCX file, I get the above XML. This XML content is separating the content by its language. Therefore, If I want to replace $ 100,000.00 in the document with some other value. the text is not found, because it is not $ 100,000.00. but $ and 100,000.00. separately as you can see from the XML.

The JSON instances to be read, and replaced:

[
  {
    "Animal": "Frog",
    "Human": "John"
  },
  {
    "Animal": "Horse",
    "Human": "Alice"
  }
 ]

The C# code that replaces the words in DOCX file:

private void ReplaceWords(Stream stream, JArray jsonStream, string filePath)
{
string finalWordDocument = null;
string docText = null;
WordprocessingDocument doc =
    WordprocessingDocument.Open(stream, true);
using (doc)
{
    using (StreamReader docxReader = new StreamReader(doc.MainDocumentPart.GetStream()))
    {
        docText = docxReader.ReadToEnd();
    }
    foreach (JObject item in jsonStream)
    {
        String animals = item.GetValue("Animal").ToString();
        String humans = item.GetValue("Human").ToString();
        Console.WriteLine(animals);
        Console.WriteLine(humans);
        finalWordDocument = docText.Replace(animals, humans);
    }

    Console.WriteLine(finalWordDocument.ToString());
    StreamWriter docxWriter = new StreamWriter(doc.MainDocumentPart.GetStream(FileMode.Create));
    using (docxWriter)
    {
        docxWriter.Write(finalWordDocument);
    }
}

}

But, I want to be able to get $ 100,000.00 together without any separation (or any other text).

Is there any way to ignore language preferences and create XML content like the basic string that contains all the values of DOCX?

Hmm... I recommend to rename .docx to .zip and look at the files inside with text editor... This may help you to understand what is actually there (as you seem to expect WordprocessingDocument to actually change the content). Side note: using string manipulation to edit XML is bad idea, make sure to read [this](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/) before you decide that regex is what you actually need. — Alexei Levenkov, Jan 19 '20 at 21:17
To clear up some potential confusion: `WordprocessingDocument` is not creating an XML document from some other format. DOCX _is_ a (zipped) XML document. The library your are using is simply uncompressing and opening this XML file. Breaking up of text that seems contiguous in the Word GUI into several XML elements is a peculiarity of Office Open XML that is expected. — Mathias Müller, Jan 19 '20 at 21:19
Renaming your Docx file to .zip is a good first step (and then poking around inside the zip file to see all the open XML goodness). A good second step would be to download the *OpenXML Productivity Tool* from Microsoft's site and use it to poke around. With it, you can see the structure of the document, the XML inside, and the code you'd need to recreate the document — Flydog57, Jan 20 '20 at 01:58
In a nutshell, there's no way to get Word to not do this. Language is like any other formatting: it's applied to *runs*. Unless the user profile has been very carefully configured, there's no way to avoid the situation you describe. Your code would need to extract *only* the `w:t` elements to get the content. And strip out all the `w:rPr`, or retain only one set of it in order to find/replace the way you decribe. Note that this problem also occurs with internal edit tracking (rsId stuff), as well as all kinds of formatting... — Cindy Meister, Jan 20 '20 at 06:17
@Sojimanatsu Note that this "content" separating, e.g. your text is broken into multiple `w:t` elements, is a relatively common thing in DOCX. As an example, this will also occur when your text has different formatting, for instance when the "100,000" part has a larger font size then ".00" part. Anyway, I recommend you to read this [Search and Replace Text in an Open XML WordprocessingML Document](http://www.ericwhite.com/blog/search-and-replace-text-in-an-open-xml-wordprocessingml-document/). — Mario Z, Jan 20 '20 at 07:24
Look at my answer to [this other question](https://stackoverflow.com/questions/59303646/open-xml-find-and-replace-multiple-placeholders-in-document-template/59328568#59328568). As others have noted already, the Open XML SDK only provides you with the exact Open XML markup created, for example, by Microsoft Word. If you have a mix of languages (e.g., en-US, en-GB) in one paragraph (`w:p`), the text will be split into runs (`w:r`). If that gets into your way, you'll have to simplify the Open XML markup by coalescing (or merging) those runs before you do the actual processing. — Thomas Barnekow, Jan 20 '20 at 08:25
Don't parse it by yourself. Use a library like [DOCX](https://github.com/xceedsoftware/DocX) — Thomas Weller, Jan 20 '20 at 09:58
I also used DocX @ThomasWeller but somehow, after the replacement process, it didn't save the file. I tried both Save and SaveAs functions. But it kept downloading the initial document, without applying the changes. I was printing out the text, perfectly replaced, however saving was a problem. — Sojimanatsu, Jan 20 '20 at 17:32
That sounds more like caching problem rather than a DocX problem — Thomas Weller, Jan 20 '20 at 18:16
@Sojimanatsu I wouldn't recommend you to use DocX library. In the past, I have encountered many times that it generated an invalid document. For example, it can write `w:r` element inside another `w:r` element, which is not correct according to OOXML specification. In most cases, MS Word won't have a problem opening such file because it has a great error repair mechanisms, but other Word applications like LibreOffice Writer won't be able to open your document. — Mario Z, Jan 21 '20 at 05:16
@MarioZ Thanks for the information. But XML is also kinda complicated to process. I choose C# and .NET environment to do these modifications because I thought Microsoft would have some native, much easier built-in libraries to modify DOCX document. Is it really the only way to go by manipulating the XML, in order to modify, parse DOCX documents in C#? (except paid software) — Sojimanatsu, Jan 21 '20 at 08:23
@Sojimanatsu unfortunately, there is no native solution. The OpenXML SDK is the closes to native that you can get, but as you already noticed it's a bit tedious to use it because of its low-level API and it requires knowing the DOCX specification. — Mario Z, Jan 21 '20 at 08:57
@Sojimanatsu if you have a simple Find & Replace requirement then you could use [my solution](https://www.codeproject.com/Articles/1106636/Find-and-Replace-text-in-a-Word-document). But note that it does have some streaming issue ([see workaround here](https://www.codeproject.com/Articles/1106636/Find-and-Replace-text-in-a-Word-document?msg=5407745#xx5407745xx)) and some limitations, for instance, the new line and tab characters are ignored because they require the use of some special XML elements like ``, `. Hopefully, I'll get the chance to improve that solution in the future. — Mario Z, Jan 21 '20 at 09:01

How to format XML content of DOCX file to not to split by different languages?

0 Answers0