I asked this question: How to find and replace multiple lines in text file? but was ultimately unclear in my question so I'm asking another one to be more specific.
I have Python 2.7.
I have three text files, data.txt
, find.txt
and replace.txt
.
data.txt
is about 1MB large file with several thousand lines. Now, I have a find.txt
file containing X number of lines that I want to find in data.txt
and replace with Y number of lines in replace.txt
X and Y may be the same number, or it may not.
For example:
data.txt
pumpkin
apple
banana
cherry
himalaya
skeleton
apple
banana
cherry
watermelon
fruit
find.txt
apple
banana
cherry
replace.txt
1
2
3
4
5
So, in the above example, I want to search for all occurences of apple
, banana
, and cherry
in the data and insert 1,2,3,4,5
in its place.
So, the resulting data.txt
would look like:
pumpkin
1
2
3
4
5
himalaya
skeleton
1
2
3
4
5
watermelon
fruit
Or, if the number of lines in replace.txt
were less than that of find.txt
:
pumpkin
1
2
himalaya
skeleton
1
2
watermelon
fruit
I am having some trouble with the right approach to this as my data.txt
is about 1MB so I want to be as efficient as possible. One dumb way is to concatenate everything into one long string and use replace
, and then output to a new text file so all the line breaks will be restored.
data = open("data.txt", 'r')
find = open("find.txt", 'r')
replace = open("replace.txt", 'r')
data_str = ""
find_str = ""
replace_str = ""
for line in data: # concatenate it into one long string
data_str += line
for line in find: # concatenate it into one long string
find_str += line
for line in replace:
replace_str += line
new_data = data_str.replace(find, replace)
new_file = open("new_data.txt", "w")
new_file.write(new_data)
But this seems so convoluted and inefficient for a large data file like mine.
The pseudo-code for something that I would like to see:
Something like this:
(x,y) = find_lines(data.txt, find.txt) # returns line numbers in data.txt that contains find.txt
replace_data_between(x, y, data.txt, replace.txt) # replaces the data between lines x and y with replace.txt
def find_lines(...):
location = 0
LOOP1:
for find_line in find:
for i, data_line in enumerate(data).startingAtLine(location):
if find_line == data_line:
location = i # found possibility
for idx in range(NUMBER_LINES_IN_FIND):
if find_line[idx] != data_line[idx+location] # compare line by line
#if the subsequent lines don't match, then go back and search again
goto LOOP1
As you can see, I am having trouble with the logic of this all. Can someone point me in the right direction?