3

I have a document which has been sloppily authored. It's a dictionary that contains cyrillic characters. Most of the dictionary is manageable, but I'm stuck with one thing I need help with. Words have accented letters in them and they're mostly formatted properly as a letter with a unicode accent (thus forming a single letter). However there are some very peculiar letters that look similar for example to: a;´ (where "a" is any arbitrary cyrillic letter). You'd expect á in its place. However it wouldn't be a problem per se if only this thing could be exported to, say HTML and manipulated in a text editor. The problem is that Word treats this "thing" as a single character/entity and

  • when exporting it is COMPLETELY omitted
  • when copied it can only be pasted into Notepad (which translates it into three separate characters), when being pasted into WordPad it just won't appear at all.
  • when a search is run in Word it won't find the letter, neither the actual character nor the exactly copied/pasted combination.
  • the letter will disappear when the document is opened in any other software, such as Libre Office

At this point I'm trying to:

  • understand what this combination is exactly
  • run a search/replace operation to find and weed out all of those errors

Here's a sample Word file.

Here's a screenshot of the word/letter in question:

enter image description here

which when typed correctly should appear like "скре́пка".

Beth
  • 9,531
  • 1
  • 24
  • 43
Захар Joe
  • 645
  • 2
  • 8
  • 14
  • Avast! claims that the resource linked to in the question has been infected by URL:Mal. – Jukka K. Korpela Oct 16 '12 at 21:46
  • This is not in any sense a programming question, and is therefore off-topic here. Questions about Word automation or VBA scripting are appropriate here; use of Word in general are not. The [FAQ](http://stackoverflow.com/faq) has more info on the types of questions that are appropriate here. Voting to close and migrate to [SuperUser](http://superuser.com) where it's more appropriate. – Ken White Oct 16 '12 at 22:55
  • Ken, technically it's not, but my final purpose is. I need to have the document broken down into small parts and being put into a database through exporting and running regex search/replace queries. So the intention (at least) is programming. – Захар Joe Oct 17 '12 at 07:32
  • Joe: OK. So if I ask a question about what kind of computer I should buy, it's on topic here if someday I plan on using it to write code? Sorry - I don't think so. :-) – Ken White Oct 17 '12 at 23:25

2 Answers2

1

The 'character' appears to be a Word field of type 'eq' (equation). Here is the field with toggled field codes:

enter image description here

If it is a large document you could try to create a VBA routine that removes the fields and replaces them with corresponding characters.

Anonimista
  • 742
  • 1
  • 5
  • 12
  • Wow, thank you! Such a simple solution that I missed in the context menu. Replacing those won't be a problem now that I have the symbols exposed. – Захар Joe Oct 17 '12 at 07:43
0

Assuming that @Anonimista’s analysis is correct, as I think it is, you could fix the file by running some search and replace operations in Word, replacing e.g. ^19eq \o(е;´)^21 by е́ (the latter is Cyrillic letter е followed by combining acute accent U+0301). This is dull because you would need to do this for each vowel separately (and for uppercase vowels too). But I cannot find a way to use wildcards in this context; the codes ^19 and ^21 for start and end of field work only when wildcards are not enabled.

Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390