1

I have a dictionary dict with some words (2000) and I have a huge text, like Wikipedia corpus, in text format. For each word that is both in the dictionary and in the text file, I would like to replace it with word_1.

with open("wiki.txt",'r') as original, open("new.txt",'w') as mod:
    for line in original:
        new_line = line
        for word in line.split():
            if (dict.get(word.lower()) is not None):
                new_line = new_line.replace(word,word+"_1")           
        mod.write(new_line)

This code creates a new file called new.txt with the words that appear in the dictionary replaced as I want.

This works for short files, but for the longer that I am using as input, it "freezes" my computer.

Is there a more efficient way to do that?

Edit for Adi219:

Your code seems working, but there is a problem: if a line is like that: Albert is a friend of Albert and in my dictionary I have Albert, after the for cycle, the line will be like this:Albert_1_1 is a friend of Albert_1. How can I replace only the exact word that I want, to avoid repetitions like _1_1_1_1?

Edit2: To solve the previous problem, I changed your code:

with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
    for line in original:
        words = line.split()
        for word in words:
            if dict.get(word.lower()) is not None:
                mod.write(word+"_1 ")
            else:
                mod.write(word+" ")
        mod.write("\n")

Now everything should work

SctALE
  • 509
  • 2
  • 10
  • 30
  • It is hard to say whether a different strategy would be more efficient or not. I think you might want to find out what the problem actually is. Use timeit and some memory profiler. – Falk Schuetzenmeister Mar 07 '18 at 18:30
  • 2
    Possible duplicate of [Read large text files in Python, line by line without loading it in to memory](https://stackoverflow.com/questions/6475328/read-large-text-files-in-python-line-by-line-without-loading-it-in-to-memory) – Chamath Mar 07 '18 at 18:31
  • People looking for dataframe can do the same as `dataframe.replace(replace_what, replace_with)`. See [this](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) for documentation. – Rex5 Aug 12 '19 at 11:22

1 Answers1

3

A few things:

  1. You could remove the declaration of new_line. Then, change new_line = new_line.replace(...) line with line = line.replace(...). You would also have to write(line) afterwards.

  2. You could add words = line.split() and use for word in words: for the for loop, as this removes a call to .split() for every iteration through the words.

  3. You could (manually(?)) split your large .txt file into multiple smaller files and have multiple instances of your program running on each file, and then you could combine the multiple outputs into one file. Note: You would have to remember to change the filename for each file you're reading/writing to.

So, your code would look like:

with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
    for line in original:
        words = line.split()
        for word in words:
            if dict.get(word.lower()) is not None:
                line = line.replace(word, word + "_1")
        mod.write(line)
Adi219
  • 4,712
  • 2
  • 20
  • 43
  • Your code seems working, but there is a problem. I updated the question, can you take a look? – SctALE Mar 08 '18 at 09:09