0

I have large amounts of list for replacement like below.

The remplacement file list.txt:

人の,NN
人の名前,FF

And the data in which to replace text.txt :

aaa人の abc 人の名前def ghi

I want to replace this text to like below using list.txt.

>>> my_func('aaa人の abc 人の名前def ghi')
'aaaNN abc FFdef ghi'

This is my code. But I think this is quite inefficiency to process large data.

d = {}
with open('list.txt', 'r', encoding='utf8') as f:
    for line in f:
        line = line.strip()
        d[line.split(',')[0]] = line.split(',')[1]

with open('text.txt', 'r', encoding='utf8') as f:
    txt = f.read()

st = 0
lst = []

# \u4e00-\u9fea\u3040-\u309f] means the range of unicode of Japanese character
for match in re.finditer(r"([\u4e00-\u9fea\u3040-\u309f]+)", txt):
    st_m, ed_m = match.span()
    lst.append(txt[st:st_m])

    search = txt[st_m:ed_m]
    rpld = d[search]
    lst.append(rpld)

    st = ed_m

lst.append(txt[st:])

print(''.join(lst))

Please let me know better way.

arnaud
  • 3,293
  • 1
  • 10
  • 27
Bryan
  • 39
  • 6
  • Possible duplicate of this though https://stackoverflow.com/questions/6116978/how-to-replace-multiple-substrings-of-a-string – arnaud Apr 10 '18 at 13:20
  • @Arnaud No it can't be the solution. rep = {'a':'1', 'aa':'2'} pattern.sub(lambda m: rep[re.escape(m.group(0))], "a and aa") => The output string is '1 and 11'. But in my case it should be '1 and 2'. – Bryan Apr 11 '18 at 05:32
  • What if you just order your replacements by alphabetical order but decreasing length s.t. longer patterns be replace before -- and thus avoid conflicts ? Also, this thread I point to has **a lot** of answers with high votes, within some that discuss your issue. – arnaud Apr 11 '18 at 09:21

1 Answers1

0

After seeing your input aaa人の abc 人の名前def ghi I see you have white-spaces in between. So it's not really a word replace it's more of a phrase replace.

You can refer to the edit history to see the old answer in case you want word replacement

In such a case that you have phrase replacement, you can use re (reg-ex) and provide a array of replacements. Below is an implementation:

>>> import re
>>> _regex = {r'aaa人の abc 人の名前def ghi': r'人の,NN 人の名前,FF'}
>>> input_string = 'hi aaa人の abc 人の名前def ghi work'
>>> for pattern in _regex.keys():
        input_string = re.sub(pattern, _regex[pattern], input_string)


>>> input_string
'hi 人の,NN 人の名前,FF work'
>>> 

Below is an object oriented implementation of the above

import csv
import re


class RegexCleanser(object):
    _regex = None

    def __init__(self, input_string: str):
        self._input_string = input_string
        self._regex = self._fetch_rows_as_dict_keys(r'C:\Users\adity\Desktop\japsyn.csv')

    @staticmethod
    def _fetch_rows_as_dict_keys(file_path: str) -> dict:
        """
        Reads the data from the file
        :param file_path: the path of the file that holds the lookup data
        :return: the read data
        """
        try:
            word_map = {}
            for line in csv.reader(open(file_path, encoding='UTF-8')):
                word, syn = line
                word_map[word] = syn
            return word_map
        except FileNotFoundError:
            print(f'Could not find the file at {file_path}')

    def clean(self)-> str:
        for pattern in self._regex.keys():
            self._input_string = re.sub(pattern, self._regex[pattern], self._input_string)
        return self._input_string

Usage:

if __name__ == '__main__':
    cleaner = RegexCleanser(r'hi aaa人の abc 人の名前def ghi I dont know this language.')
    clean_string = cleaner.clean()
    print(clean_string)
iam.Carrot
  • 4,976
  • 2
  • 24
  • 71
  • When I use 'aaa人の abc 人の名前def ghi' as input_string and 人の,NN 人の名前,FF as 'synonyms.csv', it's not replaced. – Bryan Apr 10 '18 at 11:28
  • @Bryan What language is this? – iam.Carrot Apr 10 '18 at 11:29
  • Some of Japanese characters are included in my sample. – Bryan Apr 10 '18 at 11:39
  • @Bryan Great so what you can do is, put them in a `csv` too. Now use `string encoding` to actually define it. Since `strings` are just sequence of `bytes`. You can use `UTF-8` or even `UTF-32` or you can also use `Unicode` as the encoding format for the `csv`. Can you please share an input string and what it needs to be replaced with. I can wrap a demo around it – iam.Carrot Apr 10 '18 at 15:56
  • Thank you for your help. But your code is not what I want. Please read my edited post again. – Bryan Apr 11 '18 at 05:00