1

So i have a big text file , around 900 MB , I want to read the file line by line , and for each line , do find and replace , based on items on a list of phrases , let's take up a hypothetical situation

Let's say that I have a single .txt file containing all wikipedia in plaintext.

I have a python list of phrases , call it P , P = ['hello world','twently three' ,'any bigram','any trigram' ] , all items in P are phrases ( no single word exists)

Given this list P , I am trying to scan the .txt file , line by line and using P , check if any of P's items are existing in current line and if they do exist replace space between words with _ , for example if current line says : "hello world twently three any text goes here" , it should replace it like : "hello_world twently_three any text goes here" the length of P is 14,000

I have implemented this in python , and it is very slow , it can only perform this on average rate of about 5,000 lines / minute , the .txt file is huge with millions of lines , is there any efficient way of doing this ? Thanks

Update :

with open("/media/saurabh/New Volume/wikiextractor/output/Final_Txt/single_cs.txt") as infile:
    for index,line in enumerate(infile):
        for concept_phrase in concepts:
            line = line.replace(concept_phrase, concept_phrase.replace(' ', '_'))
        with open('/media/saurabh/New Volume/wikiextractor/output/Final_Txt/single_cs_final.txt', 'a') as file:
            file.write(line +  '\n' )  
        print (index)
saurabh vyas
  • 115
  • 1
  • 10
  • Until we see your code it's difficult to suggest how it could be sped up. 5,000 lines a minutes certainly sounds very slow. – holdenweb Aug 26 '17 at 17:22
  • I agree, I have updated the question with github gist Thanks – saurabh vyas Aug 26 '17 at 17:29
  • 1
    your problem is `for` inside `for`, take a look here how to do it properly: https://stackoverflow.com/questions/16622754/how-do-you-replace-a-line-of-text-in-a-text-file-python – sKwa Aug 26 '17 at 17:40
  • @skwa I have tried the code provided in the accepted answer at that link , It still gives about same average speed of 5,000 lines by min , at this speed i can't do this for this 900 mb file – saurabh vyas Aug 26 '17 at 17:47
  • what about `sed`? – sKwa Aug 26 '17 at 17:51
  • sed is great tool for command line applications , but I am working in python , have a python list which have 14,000+ word phrases , each of which needs to be searched and replaced , I am not sure how sed is going to work in this scenario – saurabh vyas Aug 26 '17 at 18:00
  • 1
    I wrote a perl script and getting a speed of 70k Lines per second! `Rate = 70866 lines / sec, lines = 63000000, time elapsed = 889 seconds Rate = 70800 lines / sec, lines = 63083307, time elapsed = 891 seconds real 890.91 user 879.12 sys 7.08 $ wc -l input.txt output.txt phrases.txt 63083307 input.txt 63083307 output.txt 4 phrases.txt ` – Sameer Naik Aug 27 '17 at 20:49
  • @SameerNaik can you tell me the hardware specs of your system , and with how many phrases you are testing , for each line ? I have 14,000 + phrases – saurabh vyas Aug 29 '17 at 18:40
  • I just used 4 phrases. Specs are MacOS Sierra 10.12.6. MacBook Pro 15 inch Mid 2015 2.2GHz Core i7 16GB RAM. Send me all your phrases if you can. – Sameer Naik Sep 03 '17 at 05:53
  • @sameerNaik , I really appreciate your help, but this was just an initial investigation to solve a problem , I discussed with my research adviser , and now we are looking at an alternate approach which has far less time complexity , but thanks again! – saurabh vyas Sep 03 '17 at 10:41

2 Answers2

2

You should not open and close the output file at every line. More so, you can store the replacements for each concept_phrase and avoid making k * n replacements (k is number of concept phrases, n is number of lines) for the translated version of the concept_phrases:

in_file = "/media/saurabh/New Volume/wikiextractor/output/Final_Txt/single_cs.txt"
out_file = "/media/saurabh/New Volume/wikiextractor/output/Final_Txt/single_cs_final.txt"
replacement = dict([(cp, cp.replace(' ', '_')) for cp in concepts])

with open(in_file) as infile, open(out_file, 'a') as file:
    for line in infile:
        for concept_phrase in concepts:
            line = line.replace(concept_phrase, replacement[concept_phrase])
        file.write(line) 

str.replace is generally fast, and I doubt a one-shot replacement with re.sub is going to beat that even if the calls to str.replace are repeated.

Moses Koledoye
  • 77,341
  • 8
  • 133
  • 139
  • the average line rate is now about 7,000 per min , I guess it's not going to jump by a huge factor even by optimization because of many phrases in that list(14,000 + ) , but thank you for providing the code , It is working fine and shows improvement in speed. – saurabh vyas Aug 26 '17 at 17:59
  • 1
    @saurabhvyas Also, make sure to remove that `print`. Printing also accounts for some of the time. – Moses Koledoye Aug 26 '17 at 18:00
  • I see , printing was to give me some assurance , that some progress is being made , and for getting rough idea, now that I know it , Ill remove it , thanks – saurabh vyas Aug 26 '17 at 18:04
  • 1
    you can try using tqdm, which will show you a progress bar for iterations – Max Aug 26 '17 at 18:16
  • 1
    You are probably going to have to think quite hard about how you represent the phrases you want to match. There are many advanced pattern-matching techniques that you could deploy if this is not a one-off task. – holdenweb Aug 26 '17 at 18:58
  • Did you trying writing this in java and compare the speed? or may be perl. – Sameer Naik Aug 26 '17 at 21:25
  • @sameer Naik , not yet , but I am sure the performance speedup won't be that order higher that I am hoping for , so I looking for an alternate programming strategy / algorithm . – saurabh vyas Aug 27 '17 at 07:31
1

I would suggest to compile the file using cython module and try to run it. it will speed up your code.

Max
  • 1,283
  • 9
  • 20
  • I will definitely experiment that , Thanks – saurabh vyas Aug 26 '17 at 17:30
  • did cython reduce the time taken? – Max Aug 26 '17 at 18:15
  • unfortunately , I have never used cython , don't know how to use it , I tried pip installing it , but it turns out it is already installed , I will investigate more , but looking at documentation , I am confident it will definitely reduce time taken – saurabh vyas Aug 26 '17 at 18:26
  • Note that this advice, while potentially useful, should not be followed until you have done as much as you reasonably can to improve the algorithm you use. It's far more likely you can improve speed by more advanced use of Python than by simply speeding up a poorly designed program. – holdenweb Aug 26 '17 at 19:01