0

I have been running a pywikibot on Marathi wikipedia since almost a month now. The only task of this bot is find and replace. You can find overall details of pywikibot at: pywikibot. You can find the details of that particular find and replace operation at replace.py and fixes.py and even further examples of fixes here.

The following is a part of my source code. When running the bot on Marathi wikipedia, I am facing a difficulty because of the Marathi language's script. All of the replacements are going fine, but one is not. For example, I will use English words instead of Marathi.

The first part ("fix") of following code searches for "{{PAGENAME}}", and replaces it with "{{subst:PAGENAME}}". The msg parameter is the edit summary.

The second fix of the code "man", finds "man" and replaces it with "gent". But the problem is, it is also replacing "human" to "hugent", "craftsmanship" to "craftsgentship" and so on.

fixes = {
    'name': {
        'regex': True,
        'nocase': True,
        'msg': {'mr': '{{PAGENAME}} → पानाचे मूळ नाव (base name of page)'},
        'replacements': [
            ( r'{{ *PAGENAME *}}', '{{subst:PAGENAME}}' ),
        ],
    },
    'man': {
        'regex': True,
        'msg': {'mr': 'man → gent'},
        'replacements': [
            ('man', 'gent'),
        ],
    },
}

So I tried to update the find and replace parameter from ('man', 'gent') to ('man ', 'gent ') (space in the end) and then to (' man ', ' gent ') (space at the both ends). But both these changes didn't change any words, not even the original (only) "man".

So how do I change the instance of "He was a good man - a true humanitarian" to "He was a good gent - a true humanitarian" without making it hugentitarian?

Grismar
  • 27,561
  • 4
  • 31
  • 54
  • I'm not sure how pywikibot manipulates the string, but maybe try using \s instead of a space. `('\sman\s', '\sgent\s')` – M B Apr 04 '22 at 04:26
  • Works for me with spaces arround the replacement parameters. You can also set regex to False because a simple text replacemnt would do the job. – xqt Apr 13 '22 at 07:07

2 Answers2

0

You want occurrances of 'man', but only by itself - in other words, only if it's not preceded or followed by other letters or symbols that would be part of a word.

I don't know if Marathi contains symbols like '-' that could be part of a word, for example 'He was a real man-child', in which case you may or may not want to replace it.

In English, since you're using regex, you can do this:

'man': {
        'regex': True,
        'msg': {'mr': 'man → gent'},
        'replacements': [
            ('(?<=[^\w]|^)man(?=[^\w]|$)', 'gent'),
        ],
}

The regular expression '(?<=[^\w]|^)man(?=[^\w]|$)' there means:

  • the literal word 'man'
  • preceded by any character that's not a word character [^\w], or the start of the line ^
  • followed by any character that's not a word character [^\w], or the end of the line $

Note that this doesn't cover Man, unless your regex engine is already set to be case-insensitive.

If your regex engine doesn't consider the characters that make up Marathi words to be part of \w, you could replace that with a string of all the characters that make up the language, if that's achievable (unlike it would be in logographic languages like Chinese).

Note that, when testing the regex in some environments, it needs that |^ and |$, while in others it may cause issues.

In pure Python, this works:

import re

text = 'He was a good man, a true humanitarian.'
print(re.sub('(?<=[^\w])man(?=[^\w])', 'gent', text))

text = 'तो एक चांगला माणूस होता माणूसला'
print(re.sub('(?<=[^\w])माणूस(?=[^\w])', 'व्यक्ती', text))

Output:

He was a good gent, a true humanitarian.
तो एक चांगला व्यक्ती होता माणूसला

So that (?<=[^\w])man(?=[^\w]) may be all you need. (I hope the Marathi here isn't accidentally rude - I blame Google Translate)

Grismar
  • 27,561
  • 4
  • 31
  • 54
  • Thanks a lot, I have not tried it, but I think that would work. I found out the reason for for my code not working. Before running the updated code in actual articles, I was experimenting with it in my sandbox. The 'man ' syntax was working properly, but in my sandbox one occurrence was with a link to the article of man, and another was bare but at the end of the line, without period, and without whitespace as well. These issues can be handled in the articles themselves, with 'man ' syntax. I have noted down the regex. Thanks a lot again. – usernamekiran Apr 04 '22 at 09:36
-1

Why dont you try this - change (man) into gent. Then run another code and replace all (hugents) to (human) a simple fix.

  • Hi. Thanks for the solution, but that would not be feasible at all. The possible changes would be tremendous. human, woman, humanitarian, sportsman, sportsmanship come to top of my head. Changing all these dozens of occurrences in all the articles of wikipedia to something gibberish would be a huge no-no, and then finding them, and changing them back would be a huge hassle. That would be abusing server resources as well. And after everything is done, it would be disruptive editing from non technical point of view. The end result would be my bot access/authorisation being revoked. – usernamekiran Apr 04 '22 at 09:45