0

I want to replace text in my word document. I am able to replace text strings which are matching completely, but I want to replace it if it will match 90% with the searched string.

I am using python-docx for working with Word documents.
Below code replaces text in my word document if it matches completely.
Code link

def docx_replace_regex(doc_obj, regex , replace):

for p in doc_obj.paragraphs:
    if regex.search(p.text):
        inline = p.runs
        # Loop added to work with runs (strings with same style)
        for i in range(len(inline)):
            if regex.search(inline[i].text):
                text = regex.sub(replace, inline[i].text)
                #inline[i].text = text.decode('UTF-8')
                inline[i].text = text

for table in doc_obj.tables:
    for row in table.rows:
        for cell in row.cells:
            docx_replace_regex(cell, regex , replace)

I am not getting a proper way to replace/substitute the partially matched string.
Any kind of help is much appreciated.
Thanks in advance.

Purva
  • 43
  • 1
  • 8
  • I don't think this has to do with `python-docx` per se, or at least it doesn't have to. `python-docx` can give you a `str` object, which you can then modify as you please and "write back" to `python-docx`. Your problem reduces to how to do a fuzzy search/replace on a `str` object, which you should be able to find more about on search, like this top Google hit: https://github.com/seatgeek/fuzzywuzzy – scanny Sep 25 '18 at 18:20
  • @scanny Actually I am working on machine translation, I want to replace the English text with the translated text in my **word document** line by line by searching and replacing the text strings for which I am using Python-docx and using `run` to keep the formatting of text. I am able to replace text when I am getting an exact match of a string using the mentioned code but I want to replace text string when it matches around 90%. – Purva Sep 26 '18 at 06:23

1 Answers1

0

I don't think filtering regular expressions gives the right results, because the re module only gives non-overlapping matches; if you're filtering out some matches, then a less-than-90% match that overlaps with a 90%+ match will prevent the 90%+ match from being recognized.

I also considered difflib, but that will give you the first match, not the best match.

I think you'll have to write it from scratch.

Something like:

def find_fuzzy_match(match_string, text):
    # use an iterator so that we can skip to the end of a match.
    text_iter = enumerate(text)
    for index, char in text_iter:
        try:
            match_start = match_string.index(char)
        except ValueError:
            continue
        match_count = 0
        zip_char = zip(match[match_start:], text[index:])
        for match_index, (match_char, text_char) in enumerate(zip_char):
            if match_char == text_char:
                match_count += 1
                last_match = match_index
        if match_count >= len(match_string) * 0.9:
            yield index, index + last_match
            # Advance the iterator past the match
            for x in range(last_match):
                next(text_iter)
Aaron Bentley
  • 1,332
  • 8
  • 14
  • Actually I am working on machine translation, I want to replace the English text with the translated text in my word document line by line by searching and replacing the text strings for which I am using Python-docx and using run to keep the formatting of text. I am able to replace text when I am getting an exact match of a string using the mentioned code but I want to replace text string when it matches around 90%. – Purva Sep 26 '18 at 12:07
  • 1
    my find_fuzzy_match will find matches. It should be easy to replace them once you've found them. – Aaron Bentley Sep 26 '18 at 14:10