0

file2 has a big list of numbers. File1 has a small list of numbers. file2 is a duplicate of some of the numbers in file1. I want to remove the duplicate numbers in file2 from file1 without deleting any data from file2 but at same time not deleting the line number in file1. I use PyCharm IDE and that assigns the line number. This code does remove the duplicate data from file1 and does not remove the data from file2. Which is what I want, however it is deleting the duplicate numbers and the lines and rewiting them in file1 which is what I don't want to do.

import fileinput

# small file2
with open('file2.txt') as fin:
    exclude = set(line.rstrip() for line in fin)
# big file1
    for line in fileinput.input('file1.txt', inplace=True):
        if line.rstrip() not in exclude:
            print(line)

Example: of what is happening, file2 34344

file-1 at start:
54545
34344
23232
78787

file-1 end:
54545
23232
78787

What I want.

file-1 start:
54545
34344
23232
78787

file-1 end:
54545

23232
78787

cyberzyme
  • 13
  • 6
  • Does file2 have the line number (is a numbered list) too? Do you care about it? I mean, let say that file2 has some data which is present also in file1 but in a different line. Should be removed? – Valentino Feb 15 '19 at 02:21
  • The data in file2 has line numbers but are insignificant. ie. file1 has 1000 lines of data day one. file2 has no lines and no data. day two file1 has 1000 lines but only 950 have data. at the same time file2 day two 50 lines of data. then file2 gets more data day 3 it is compared to file1 and removed. process continues file2 grows in data while file1 reduces in data until file1 has no data just 1000 line numbers and file2 has 1000 lines of data. – cyberzyme Feb 18 '19 at 15:02
  • @Valintino thank you for pointing out my error in the print statement and I am sorry about my mistake in the example with line numbers and dots that are not there. The line numbers are just assigned by PyCharm. After examination of your code I think I see where my problem might be. My code does remove duplicates and does exactly what I want as far as file2. It is file1 where the problem lies and in the For loop in that it needs some code to either leave blank lines where the removed data was or fill those lines with an empty string. But I am not sure how to do that or if that is the best way. – cyberzyme Feb 19 '19 at 14:58
  • Update I just realized that my code in the for loop is reprinting file1 after removing the duplicate data and that is something I do not want to do. I just want to remove the duplicates and leave the lines blank as reprinting a file that will eventually contain millions and maybe billions of data is inefficient. – cyberzyme Feb 19 '19 at 15:17
  • I'm afraid following the comments is becoming hard. Could you update the question fixing the misleading stuffs (like the line numbers which are not in the text) please? – Valentino Feb 19 '19 at 15:26
  • You may also read [this post](https://stackoverflow.com/questions/5453267/is-it-possible-to-modify-lines-in-a-file-in-place). In fact, after your update, I think your question is probably a duplicate of that. – Valentino Feb 19 '19 at 16:13
  • @Valentino Thank you i rephrased the question and redid the examples I also looked at that other example and it did not do what I wanted. – cyberzyme Feb 19 '19 at 17:45
  • Just one other question: are your data integers only? – Valentino Feb 19 '19 at 17:52
  • They are number strings in a text file but yes numbers only – cyberzyme Feb 19 '19 at 18:18

1 Answers1

0

You just need to print an empty line when you find a data that is in the exclude set.

import fileinput

# small file2
with open('file2.txt') as fin:
    exclude = set(line.rstrip() for line in fin)
# big file1
    for line in fileinput.input('file1.txt', inplace=True):
        if line.rstrip() not in exclude:
            print(line, end='')
        else:
            print('')    

If file1.txt is:

54545
1313
23232
13551

And file2.txt is:

1313
13551

After running the script before file1.txt becomes:

54545

23232

Small note on efficiency

As you said, this code is in fact rewriting all the lines, those edited and those not. Delete and rewrite only few lines in the middle of a file is not easy, and in any case I am not sure it will be more efficient in your case, as you do not know a priori which lines should be edited: you will always need to read and process the full file line by line to know which lines should be edited. As far as I know, you will hardly find a solution really more efficient than this one. Glad to be denied if anybody knows how.

Community
  • 1
  • 1
Valentino
  • 7,291
  • 6
  • 18
  • 34
  • Sorry no there is no dot and the line number is assigned by my IDE Pycharm. they are just a single column of numbers. I thought about filling blank space with zeros but not sure if that will work. – cyberzyme Feb 19 '19 at 14:44
  • @cyberzyme I've edit my answer accordingly to the modified question. – Valentino Feb 19 '19 at 19:39
  • Thank you it works perfectly how I want it to. Yes it is a linear search as far as running over every line I think if I add code to logn search for each element in file2 in file1 that that will reduce run time considerably. – cyberzyme Feb 19 '19 at 20:26