I am trying to parse the DOCX document in C#. When I read the content of DOCX and convert it to XML, I have an issue of automatic language assignment based on the content.
For example. In the word document, I have a sentence like $ 100,000.00.
When I convert this to XML format, the output is:
<w:r>
<w:rPr>
<w:rFonts w:hint="default" w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
<w:sz w:val="22"/>
<w:szCs w:val="28"/>
<w:lang w:val="en-US"/>
</w:rPr>
<w:t xml:space="preserve">$ </w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:hint="default" w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
<w:sz w:val="22"/>
<w:szCs w:val="28"/>
<w:lang w:val="en-GB"/>
</w:rPr>
<w:t>100,000.00.</w:t>
</w:r>
If there is another language, it also splits them based on the language, like es-ES, en-US etc.
I am using C# to replace the words in the DOCX file. To this replacement, I use WordProcessingDocument
library. As far as I understood, this library does the parsing and streaming the document in OpenXML
format. Which is why when I print out the content of the DOCX file, I get the above XML. This XML content is separating the content by its language. Therefore, If I want to replace $ 100,000.00 in the document with some other value. the text is not found, because it is not $ 100,000.00.
but $
and 100,000.00.
separately as you can see from the XML.
The JSON instances to be read, and replaced:
[
{
"Animal": "Frog",
"Human": "John"
},
{
"Animal": "Horse",
"Human": "Alice"
}
]
The C# code that replaces the words in DOCX file:
private void ReplaceWords(Stream stream, JArray jsonStream, string filePath)
{
string finalWordDocument = null;
string docText = null;
WordprocessingDocument doc =
WordprocessingDocument.Open(stream, true);
using (doc)
{
using (StreamReader docxReader = new StreamReader(doc.MainDocumentPart.GetStream()))
{
docText = docxReader.ReadToEnd();
}
foreach (JObject item in jsonStream)
{
String animals = item.GetValue("Animal").ToString();
String humans = item.GetValue("Human").ToString();
Console.WriteLine(animals);
Console.WriteLine(humans);
finalWordDocument = docText.Replace(animals, humans);
}
Console.WriteLine(finalWordDocument.ToString());
StreamWriter docxWriter = new StreamWriter(doc.MainDocumentPart.GetStream(FileMode.Create));
using (docxWriter)
{
docxWriter.Write(finalWordDocument);
}
}
}
But, I want to be able to get $ 100,000.00 together without any separation (or any other text).
Is there any way to ignore language preferences and create XML content like the basic string that contains all the values of DOCX?