Weird characters in a Microsoft Word document won't export/can't be searched

Question

I have a document which has been sloppily authored. It's a dictionary that contains cyrillic characters. Most of the dictionary is manageable, but I'm stuck with one thing I need help with. Words have accented letters in them and they're mostly formatted properly as a letter with a unicode accent (thus forming a single letter). However there are some very peculiar letters that look similar for example to: a;´ (where "a" is any arbitrary cyrillic letter). You'd expect á in its place. However it wouldn't be a problem per se if only this thing could be exported to, say HTML and manipulated in a text editor. The problem is that Word treats this "thing" as a single character/entity and

when exporting it is COMPLETELY omitted
when copied it can only be pasted into Notepad (which translates it into three separate characters), when being pasted into WordPad it just won't appear at all.
when a search is run in Word it won't find the letter, neither the actual character nor the exactly copied/pasted combination.
the letter will disappear when the document is opened in any other software, such as Libre Office

At this point I'm trying to:

understand what this combination is exactly
run a search/replace operation to find and weed out all of those errors

Here's a sample Word file.

Here's a screenshot of the word/letter in question:

enter image description here

which when typed correctly should appear like "скре́пка".

Avast! claims that the resource linked to in the question has been infected by URL:Mal. — Jukka K. Korpela, Oct 16 '12 at 21:46
This is not in any sense a programming question, and is therefore off-topic here. Questions about Word automation or VBA scripting are appropriate here; use of Word in general are not. The [FAQ](http://stackoverflow.com/faq) has more info on the types of questions that are appropriate here. Voting to close and migrate to [SuperUser](http://superuser.com) where it's more appropriate. — Ken White, Oct 16 '12 at 22:55
Ken, technically it's not, but my final purpose is. I need to have the document broken down into small parts and being put into a database through exporting and running regex search/replace queries. So the intention (at least) is programming. — Захар Joe, Oct 17 '12 at 07:32
Joe: OK. So if I ask a question about what kind of computer I should buy, it's on topic here if someday I plan on using it to write code? Sorry - I don't think so. :-) — Ken White, Oct 17 '12 at 23:25

score 1 · Accepted Answer · answered Oct 16 '12 at 21:16

1

The 'character' appears to be a Word field of type 'eq' (equation). Here is the field with toggled field codes:

enter image description here

If it is a large document you could try to create a VBA routine that removes the fields and replaces them with corresponding characters.

answered Oct 16 '12 at 21:16

Anonimista

742
1
5
12

Wow, thank you! Such a simple solution that I missed in the context menu. Replacing those won't be a problem now that I have the symbols exposed. – Захар Joe Oct 17 '12 at 07:43

score 0 · Answer 2 · answered Oct 16 '12 at 22:45

Assuming that @Anonimista’s analysis is correct, as I think it is, you could fix the file by running some search and replace operations in Word, replacing e.g. ^19eq \o(е;´)^21 by е́ (the latter is Cyrillic letter е followed by combining acute accent U+0301). This is dull because you would need to do this for each vowel separately (and for uppercase vowels too). But I cannot find a way to use wildcards in this context; the codes ^19 and ^21 for start and end of field work only when wildcards are not enabled.

Thanks Jukka. Running a search for every separate vowel is not too much work at all! — Захар Joe, Oct 17 '12 at 07:44

Weird characters in a Microsoft Word document won't export/can't be searched

2 Answers2