0

I am currently trying to put together a python script to compare two text files (tab-separated values). The smaller file consists of one field per record of key values (e.g. much like a database primary key), whereas the larger file is comprised of a first-field key, up to thousands of fields per record, with tens of thousands of records.

I am trying to select (from the larger file) only the records which match their corresponding key in the smaller file, and output these to a new text file. The keys occur in the first field of each record.

I have hit a wall. Admittedly, I have been trying for loops, and thus far have had minimal success. I got it to display the key values of each file--a small victory!

I may be a glutton for punishment, as I am bent on using python (2.7) to solve this, rather than import it into something SQL based; I will never learn otherwise!

UPDATE: I have the following code thus far. Is the use of forward-slash correct for the write statement?

# Defining some counters, and setting them to zero.
counter_one = 0
counter_two = 0
counter_three = 0
counter_four = 0
# Defining a couple arrays for sorting purposes.
array_one = []
array_two = []

# This module opens the list of records to be selected.
with open("c:\lines_to_parse.txt") as f0:       
    LTPlines = f0.readlines()
    for i, line in enumerate(LTPlines):
         returned_line = line.split()
         array_one.append(returned_line)
    for line in array_one:
         counter_one = counter_one + 1

# This module opens the file to be trimmed as an array.
with open('c:\target_data.txt') as f1:       
    targetlines = f1.readlines()
    for i, line in enumerate(targetlines):
         array_two.append(line.split())
    for line in array_two:
        counter_two = counter_two + 1    

# The last module performs a logical check
#  of the data and writes to a tertiary file.
with open("c:/research/results", 'w') as f2:
    while counter_three <= 3: #****Arbitrarily set, to test if the program will work.
        if array_one[counter_three][0] == array_two[counter_four][0]:
            f2.write(str(array_two[counter_four]))
            counter_three = (counter_three + 1)
            counter_four = (counter_four + 1)
        else:
            counter_four = (counter_four + 1)
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Joe Nolan
  • 1
  • 3

1 Answers1

0

You could create a dictionary with the keys in the small file. The key in the small file as th ekey and the value True (is not important). Keep this dict in memory.

Then open the file where you will write to (output file) and the larger file. Check for each line in the larger file if the key exist in the dictionary and if it does write to the output file.

I am not sure if is clear enough. Or if that was your problem.

Raul Guiu
  • 2,374
  • 22
  • 37
  • I'm a Python novice, I apologize in advance. Thus far, I have two functions to enumerate through the files, and save them as lists. I had planned on iterating through the key fields, using a series of loops, as the keys are in sorted order (least to greatest), and writing a record to the new file if it found a match. Could you give me a generic code example of your idea? – Joe Nolan Mar 17 '14 at 21:43
  • CHoose one of the ways of looping on the file lines from here: http://stackoverflow.com/questions/3277503/python-read-file-line-by-line-into-array – Raul Guiu Mar 17 '14 at 21:51
  • Please see the above edit, as I've made some progress. This is probably the LEAST efficient algorithm, but as long as it works, I'm happy. – Joe Nolan Mar 18 '14 at 17:21