-1

I am a beginner at Python and met with some coding problem that I can't solve.

What I have:

the source sentences and their respective translations in two columns in a spreadsheet;

the html code which contains sentences and html tags

What I'm trying to do: use Python regex method - sub() to find and replace english sentences to their respective translated sentences.

For example: three sentences in html codes - Pumas are large animals. They are found in America. They don't eat grass

I have the translations of each sentence in the html code. I want to replace the sentences one at a time and also keep the html tags. Normally I can use the sub() method like this:

regex1 = re.compile(r'(\>.*)SOURCE_SENTENCE_HERE ?(.*\<)')

resultCode = regex1.sub(r'\1TRANSLATION_SENTENCE_HERE\2', originalHtmlCode)

I've written a python script to do this. I save the html code in a txt file and access it in my Python code (succeeded). Then I create a dictionary to store the source-target paires in the spreadsheet mentioned above (succeeded). Lastly, I use rexgex sub() method to find and replace the sentences in the html code (failed). This last part didn't work at all for some reason. Link to my Python code - https://pastebin.com/ZSUNB4yg or below:

import re, openpyxl, pyperclip

buynavFile = open('C:\\Users\\zs\\Documents\\PythonScripts\\buynavCode.txt')
buynavCode = buynavFile.read()
buynavFile.close()

wb = openpyxl.load_workbook('buynavSegments.xlsx')              
sheet = wb.get_sheet_by_name('Sheet1')                          
segDict = {}
maxRow = sheet.max_row
for i in range(2, maxRow + 1):
    segDict[sheet.cell(row=i, column=3).value] = sheet.cell(row=i, column=4).value

for k, v in segDict.items():                            
    k = '(\\>.*)' + str(k) + ' ?(.*\\<)'                
    v = '\\1' + str(v) + '\\2'                          
    buynavRegex = re.compile(k)
    buynavResult = buynavRegex.sub(v, buynavCode)

pyperclip.copy(buynavResult)                            
print('Result copied to clipboard')

Error message below:

Traceback (most recent call last):

File "C:\Users\zs\Documents\PythonScripts\buynav.py", line 20, in

buynavResult = buynavRegex.sub(v, buynavCode)

File "C:\Users\zs\AppData\Local\Programs\Python\Python36\lib\re.py", line 326, in _subx

template = _compile_repl(template, pattern)

File "C:\Users\zs\AppData\Local\Programs\Python\Python36\lib\re.py", line 317, in _compile_repl

return sre_parse.parse_template(repl, pattern)

File "C:\Users\zs\AppData\Local\Programs\Python\Python36\lib\sre_parse.py", line 943, in parse_template

addgroup(int(this[1:]), len(this) - 1)

File "C:\Users\zs\AppData\Local\Programs\Python\Python36\lib\sre_parse.py", line 887, in addgroup

raise s.error("invalid group reference %d" % index, pos)

sre_constants.error: invalid group reference 11 at position 1

Could someone enlighten me on this please? I would really appreciate it.

wbzy00
  • 146
  • 9
  • Note that `buynavResult` is getten overwritten, not appended to, in each loop. Also, I would consider traversing the DOM with something like lxml rather than using regular expressions. – FiddleStix Sep 03 '19 at 09:57
  • Thank you so much! – wbzy00 Sep 03 '19 at 14:43

1 Answers1

1

Consider if you want to use a replacement text where you have to put the contents of group 1 and concatenate them to the string 2. You could write r'\12' but this wont work because the regex parser will think that you are referencing group 12 instead of the group 1 followed by the string 2!

>>> re.sub(r'(he)llo', r'\12', 'hello')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/re.py", line 191, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/lib/python3.6/re.py", line 326, in _subx
    template = _compile_repl(template, pattern)
  File "/usr/lib/python3.6/re.py", line 317, in _compile_repl
    return sre_parse.parse_template(repl, pattern)
  File "/usr/lib/python3.6/sre_parse.py", line 943, in parse_template
    addgroup(int(this[1:]), len(this) - 1)
  File "/usr/lib/python3.6/sre_parse.py", line 887, in addgroup
    raise s.error("invalid group reference %d" % index, pos)
sre_constants.error: invalid group reference 12 at position 1

You can solve this using the \g<1> syntax to refer to the group: r'\g<1>2':

>>> re.sub(r'(he)llo', r'\g<1>2', 'hello')
'he2'

In your case your replacement string contains dynamic contents like str(v) which can be anything. If it happens to start with a number you end up in the case described before so you want to use \g<1> to avoid this issue.

Giacomo Alzetta
  • 2,431
  • 6
  • 17