0

I have two files 1st with necessary data: 1st file and 2nd with list of lines to keep: 2nd file

I have tried to make filtering by the python code:

import os.path

# loading the input files
output    = open('descmat.txt', 'w+')
input     = open('descmat_all.txt', 'r')
lists      = open('training_lines.txt', 'r')
print "Test1"

# reading the input files
list_lines = lists.readlines()
list_input = input.readlines()

print "Test2"
output.write(list_input[0])

for i  in range(len(list_lines)):
    for ii in range(len(list_input)):
        position = list_input[ii].find(list_lines[i][:-1])
        if position > -1:
            output.write(list_input[ii])
        break 

print "Test3"
output.close()

but this script cannot find any matches. What is the easiest solution to keep only the lines from the 1st file which are matching to the 2nd file?

gboffi
  • 22,939
  • 8
  • 54
  • 85
XuMuK
  • 564
  • 4
  • 11
  • 32

3 Answers3

2

For this kind of problems, Python has the set data type

# prepare a set of normalised training lines
# stripping new lines avoids possible problems with the last line

OK_lines = set(line.rstrip('\n') for line in open('training_lines.txt'))

# when you leave a with block, all the resources are released
# i.e., no need for file.close()

with open('descmat_all.txt') as infile:
    with open('descmat.txt', 'w') as outfile:
        for line in infile:
            # OK_lines have been stripped, input lines must be stripped as well
            if line.rstrip('\n') in OK_lines:
                outfile.write(line)

A simple test

boffi@debian:~/Documents/tmp$ cat check.py 
# prepare a set of normalised training lines
# stripping new lines avoids possible problems with the last line

OK_lines = set(line.rstrip('\n') for line in open('training_lines.txt'))

# when you leave a with block, all the resources are released
# i.e., no need for file.close()

with open('descmat_all.txt') as infile:
    with open('descmat.txt', 'w') as outfile:
        for line in infile:
            # OK_lines have been stripped, input lines must be stripped as well
            if line.rstrip('\n') in OK_lines:
                outfile.write(line)

boffi@debian:~/Documents/tmp$ cat training_lines.txt 
ada
bob
boffi@debian:~/Documents/tmp$ cat descmat_all.txt 
bob
doug
ada
doug
eddy
ada
bob
boffi@debian:~/Documents/tmp$ python check.py
boffi@debian:~/Documents/tmp$ cat descmat.txt 
bob
ada
ada
bob
boffi@debian:~/Documents/tmp$ 
gboffi
  • 22,939
  • 8
  • 54
  • 85
  • As output there is only empty file – XuMuK Mar 21 '16 at 14:35
  • It works for me, see the test case I've added to my post. – gboffi Mar 21 '16 at 14:45
  • Thank you for explanation, but there are my input files: http://pastebin.com/PTEm2Trp and http://pastebin.com/3MUAQzaQ - running this code results an emty file... – XuMuK Mar 21 '16 at 14:49
  • 1
    My code copies a line from A to C if A's line is exactly a line contained in B. Perhaps I've misunderstood what you want, perhaps no lines in your input file exactly match any of the lines in your training file. --- If it's a misunderstanding of mine and you could make your Q more clear, I could try to give another answer. – gboffi Mar 21 '16 at 15:20
1

If you read your files both into a list you can simple compare the lists. Look here how to do it. out should contain a list of the strings that could be matched.

out = [e for e in list_input for i in list_lines if e.startswith(i)]
output.writelines(out)
Community
  • 1
  • 1
RaJa
  • 1,471
  • 13
  • 17
  • Okay, I did not check your files before. As I understand your files, you have to match the start sequence of each string. I have changed my code a bit. Check if it works. – RaJa Mar 21 '16 at 17:12
  • thank you! I have found solution which is working; I still don't understand why this code is not working... – XuMuK Mar 21 '16 at 17:19
  • Sorry, my edited code came only just seconds ago. However, I checked my new code with the first two lines of your files and it works. – RaJa Mar 21 '16 at 17:21
  • Do you have Python 2.8? Still not working for me... Very strange, so I will stay at simple "if" solution. – XuMuK Mar 22 '16 at 12:09
  • Interesting, I am working with Python 3.5 but the functionality is basic Python. Is the `out` list already empty or the output-file only? In fact this code does exactly the same as your if-solution. Nested loops in a list comparison. – RaJa Mar 22 '16 at 13:49
0

Replacing this part of code:

for i  in range(len(list_lines)):
    for ii in range(len(list_input)):
        position = list_input[ii].find(list_lines[i][:-1])
        if position > -1:
            output.write(list_input[ii])
        break 

by this:

for i  in range(len(list_lines)):
    for ii in range(len(list_input)):
        if list_input[ii][:26] == list_lines[i][:-1]:
            output.write(list_input[ii])

Does exactly what I need.

XuMuK
  • 564
  • 4
  • 11
  • 32