3

I've been trying to solve this issue all day without success.

I have an 'original file', let's call it 'infile', which is the file I want to edit. Additionaly I have another file that functions as a 'dictionary', let's call it 'inlist'.

Here are examples of the infile:

PRMT6   10505   Q96LA8  HMGA1   02829   NP_665906
WDR77   14387   NP_077007   SNRPE   00548   NP_003085
NCOA3   03570   NP_858045   RELA    01241   NP_068810
ITCH    07565   Q96J02  DTX1    03991   NP_004407

And the inlist:

NP_060607   Q96LA8
NP_001244066    Q96J02
NP_077007   Q9BQA1
NP_858045   Q9Y6Q9

My current approach consists in splitting the lines in the respective columns, splitting the lines by the existing tabs. The objective is to read each line of the infile and check some stuff:

  1. If the element in the 3rd column of the infile is found in the 1st column of the inlist, change that element for the respective one in the inlist 2nd column
  2. If the element in the 3rd column of the infile is found in the 2nd column of the inlist, do nothing
  3. Same thing for the 5th column of the infile

This should retrieve the output:

PRMT6   10505   Q96LA8  HMGA1   02829   Q(...)
WDR77   14387   Q9BQA1  SNRPE   00548   Q(...)
NCOA3   03570   Q9Y6Q9  RELA    01241   Q(...)
ITCH    07565   Q96J02  DTX1    03991   Q(...)

NOTE: not all codes start with Q

I've tried using a while loop, but wasn't successful and I'm to ashamed to post the code here (I'm new to programming, so I don't want to get demotivated so early in the 'game'). Something that would be perfect to solve this would be:

for line in inlist #, infile: <--- THIS PART! Reading both files, splitting both files, replacing both files...
        inlistcolumns = line.split('\t')
        infilecolumns = line.split('\t')
        if inlistcolumns[0] in infilecolumns[2]:
            outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(inlistcolumns[1]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(infilecolumns[5]) + "\n")
        elif inlistcolumns[0] in infilecolumns[5]:
            outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(infilecolumns[2]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(inlistcolumns[1]) + "\n")
        else:
            outfile.write('\t'.join(infilecolumns) + '\n')

Help would be much appreciated. Thanks!

Ok, after the hints of Sephallia and Jlengrand I got this:

for line in infile:
    try:
    # Read lines in the dictionary
        line2 = inlist.readline()
        inlistcolumns = line.split('\t')
        infilecolumns = line.split('\t')
        if inlistcolumns[0] in infilecolumns[2]:
            outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(inlistcolumns[1]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(infilecolumns[5]))
        elif inlistcolumns[0] in infilecolumns[5]:
                outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(infilecolumns[2]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(inlistcolumns[1]))
        else:
                    outfile.write('\t'.join(infilecolumns))
    except IndexError:
        print "End of dictionary reached. Restarting from top."

The problem is that apparently the if statements are not doing their job, as the output file remained equal to the input file. What can I be doing wrong?

Edit 2:

As asked by some, here goes the full code:

    import os

def replace(infilename, linename, outfilename):
    # Open original file and output file
    infile = open(infilename, 'rt')
    inlist = open(linename, 'rt')
    outfile = open(outfilename, 'wt')

    # Read lines and find those to be replaced
    for line in infile:
        infilecolumns = line.split('\t')
        line2 = inlist.readline()
        inlistcolumns = line2.split('\t')
        if inlistcolumns[0] in infilecolumns[2]:
            outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(inlistcolumns[1]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(infilecolumns[5]))
        elif inlistcolumns[0] in infilecolumns[5]:
            outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(infilecolumns[2]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(inlistcolumns[1]))
        outfile.write('\t'.join(infilecolumns))

    # Close files
    infile.close()
    inlist.close()
    outfile.close()


if __name__ == '__main__':
    wdir = os.getcwd()
    outdir = os.path.join(wdir, 'results.txt')
    outname = os.path.basename(outdir)
    original = raw_input("Type the name of the file to be parsed\n")
    inputlist = raw_input("Type the name of the libary to be used\n")
    linesdir = os.path.join(wdir, inputlist)
    linesname = os.path.basename(linesdir)
    indir = os.path.join(wdir, original)
    inname = os.path.basename(indir)

    replace(indir, linesdir, outdir)

    print "Successfully applied changes.\nOriginal: %s\nLibrary: %s\nOutput:%s" % (inname, linesname, outname)

The first file to be used is hprdtotal.txt: https://www.dropbox.com/s/hohvlcdqvziewte/hprdmap.txt And the second is hprdmap.txt: https://www.dropbox.com/s/9hd0e3a8rt95pao/hprdtotal.txt

Hope this helps.

Edward Coelho
  • 235
  • 2
  • 14
  • I recommend, first read inlist and store it in memory(e.g. a dictionary), and then open and read infile and do what you want. – hamed Jul 16 '12 at 16:48
  • 1
    As more of a thought than an answer, why not do your `for line1 in inlist` and then have a separate variable, say `line2` and get the next line from the `infile` each time the loop runs? – Sephallia Jul 16 '12 at 16:48
  • @hamed I the problem with that is that I can't replace the chunks of text at will. – Edward Coelho Jul 16 '12 at 16:52
  • @Sephallia I tried that. Unfortunately one file ends earlier than the other one, so I can't really go that way, as it gives me an 'out of range error'. – Edward Coelho Jul 16 '12 at 16:53
  • @EdwardCoelho Hmm, you can do a try-catch block inside of the for-loop. Then, when you catch an exception, you can reset the shorter file to the start. In this situation, you would likely want to have the for loop control the longer loop. – Sephallia Jul 16 '12 at 16:55
  • I tried, but at the end the if statements didn't work. Any ideas? – Edward Coelho Jul 16 '12 at 18:36
  • as the order of the lines are important, are you sure you are comparing them as you want ? there may be an offset – jlengrand Jul 16 '12 at 18:42
  • @jlengrand Yup, I believe I am doing it in the most logical/right way. – Edward Coelho Jul 16 '12 at 19:24

5 Answers5

1

Woudln't something like that simply work ?

(following your snippet)

for line in infile: # read file 1 one line after the other
        try
            line2 = inlist.readline() # read a line of file 2
        catch Exception:
            print "End of file 2 reached"
        inlistcolumns = line.split('\t')
        infilecolumns = line.split('\t')
        if inlistcolumns[0] in infilecolumns[2]:
            outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(inlistcolumns[1]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(infilecolumns[5]) + "\n")
        elif inlistcolumns[0] in infilecolumns[5]:
            outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(infilecolumns[2]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(inlistcolumns[1]) + "\n")
        else:
            outfile.write('\t'.join(infilecolumns) + '\n')

I really don't get why not saving your file in memory first though, and then do a simple pattern research . I there a proper reason for you to read both files at the same time ? (does line 45 of file 1 match with line 45 of file 2 ? )

jlengrand
  • 12,152
  • 14
  • 57
  • 87
  • lines in files are not related to each other respectively. so this would not work! – hamed Jul 16 '12 at 16:59
  • That's what I don't get . Why not a dictionary then ? – jlengrand Jul 16 '12 at 17:00
  • I tried something like that, but it's what I already told Sephallia: one file ends earlier than the other and returns an indexerror (out of range). Although her (or his) last comment is a great idea. Thanks for your feedback anyway! (Btw, I wasn't the one downvoting you) – Edward Coelho Jul 16 '12 at 17:00
  • I said that before in comments, take a look at comments! – hamed Jul 16 '12 at 17:01
  • I did, but he didn't answer yet. Well not in a clear way at least – jlengrand Jul 16 '12 at 17:02
  • The objective in reading line by line is just that there isn't a pattern in the file. That's why I need the second file to be a 'dictionary'. Anyway, this works almost as I want. Just a little more changes and it is perfect. You gave me the knowledge I was missing to work around the issue. Thanks a lot! – Edward Coelho Jul 16 '12 at 17:12
  • Well, happy to help you; I was working on a version with dicts ^^ – jlengrand Jul 16 '12 at 17:14
  • Well, seems I need a bit more help after all ^^ – Edward Coelho Jul 16 '12 at 17:32
  • @jlengrand, have any more ideas? – Edward Coelho Jul 16 '12 at 18:55
  • @Edward I could have some, but I don't exactly understand what you want to do so it is difficult to answer :s – jlengrand Jul 16 '12 at 19:52
  • I'd need some more explanation about your problem, because even the answer I gave you does not sound logical for me :s. – jlengrand Jul 16 '12 at 19:58
  • @jlengrand do you have any ideas? I'm still working hard on this one and can't find the solution. – Edward Coelho Jul 17 '12 at 12:45
  • Didn't forget you, but I also have a job :). Will spend some time on it tonight – jlengrand Jul 17 '12 at 14:00
  • Ok, I've been working with dictionaries: data = {} for line in inlist: k, v = [x.strip('\n') for x in line.lower().split('\t')] data[k] = v But still have problems xD – Edward Coelho Jul 17 '12 at 14:57
1

What you're going to need to do is first read in the inlist file into memory, so that it is available for checking.

initems = []
for line in inlist:
    split = line.split()
    t = tuple(split[0], split[1])
    initems.append(t)
firstItems = dict(initems)
secondItems = [x[1] for x in initems]

That will give you data to hit against. Then open up your infile and read through it, checking against your data.

for line in infile:
    split = line.split('\t')
    if split[2] in firstItems.keys():
        split[2] = firstItems[split[2]] # proper field position
    if split[5] in firstItems.keys():
        split[5] = firstItems[split[5]] # proper field position
    outfile.write('\t'.join(split)+'\n')
Spencer Rathbun
  • 14,510
  • 6
  • 54
  • 73
  • I got your idea, but won't this use lots of memory? – Edward Coelho Jul 16 '12 at 17:24
  • @EdwardCoelho How large is your input file? This will use about the same memory as the file, the lookups will run quickly, and once the program quits it gets deleted from memory. – Spencer Rathbun Jul 16 '12 at 17:51
  • the file is not too large, but after making the editions you suggest it does the same as my edit above does taking considerably more time. I'm not saying it is a bad approach, because it totally makes sense and I really get it, but looks like the problem is in the if-statements. And that should be an error I made somewhere. – Edward Coelho Jul 16 '12 at 18:54
  • @EdwardCoelho Whoops! I looked over your original question again, and it seems I missed the `-` surrounding the keys. Comparisons are exact, so it will never find anything in the first if. I've updated my answer with the relevant fix. Hmm, nope that's not right either. – Spencer Rathbun Jul 16 '12 at 19:37
  • Don't worry with the surround '-'. Those were just to emphasize the IDs I didn't want changed. Maybe I'll just take them out, they are causing general confusion. – Edward Coelho Jul 16 '12 at 19:45
  • @EdwardCoelho Ok, I think this edit is much better. You want to check *both* fields and adjust them if necessary. Then write out your line. You may need to adjust for the dashes. – Spencer Rathbun Jul 16 '12 at 19:45
1

I would suggest loading inlist into memory as a lookup table - which is a dict in Python and looping over infile and use the lookup table to decide if you wish to replace.

I'm not 100% sure I've got your logic correct here, but it's a base you can build on.

import csv

lookup = {}
uniq2nd = set()
with open('inlist') as f:
    tabin = csv.reader(f, delimiter='\t')
    for c1, c2 in tabin:
        lookup[c1] = c2
        uniq2nd.add(c2)

with open('infile') as f, open('outfile', 'wb') as fout:
    tabin = csv.reader(f, delimiter='\t')
    tabout = csv.writer(fout, delimiter='\t')
    for row in csv.reader(tabin):
        if row[2] not in uniq2nd: # do nothing if col2 of inlist
            row[2] = lookup.get(row[2], row[2]) # replace or keep same
        # etc...
    csvout.writerow(row)
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • I never worked with the csv module, and as I am new to programming this is kinda 'chinese' for me xD – Edward Coelho Jul 16 '12 at 17:26
  • 1
    @EdwardCoelho It's just a smarter way of handling delimited files, whose field delimiter may be inside string delimiters (CSV format for instance). Very easy to use, and although a .split('\t' is reasonable for tab delimited, for CSV files it just will cause agro... It's worth having a look at :) – Jon Clements Jul 16 '12 at 17:28
  • Cool, I'll try it when I feel confident ;D Thanks for the hint! – Edward Coelho Jul 16 '12 at 17:33
1
#!/usr/bin/python

inFile = open("file1.txt")
inList = open("file2.txt")
oFile = open("output.txt", "w")

entry = {}
dictionary = {}

# Creates the dict for inFile
for line in inFile:
    lineData = line.split('\t')
    data = []
    for element in lineData:
        element = element.strip()
        data.append(element)
    entry[lineData[0]] = data

# Creates the dict for inList
for line in inList:
    lineData = line.split('\t')
    dictionary[lineData[0].strip()] = lineData[1].strip()


# Applies transformation to inFile
for item in entry.values():
    if item[2].startswith("-"):
        item[2] = item[2][1:-1]
    for key in dictionary.items():
        if item[2] == key[0]:
            item[2] = key[1]        
    item[5] = item[2]

# Writes the output file
for item in entry.values():
    for element in item:
        oFile.write(str(element))
        oFile.write('\t')
    oFile.write('\n')

As a note, make sure to format your inFile and inList appropriately with the correct delimiter. In this case I used the tab character (\t) to split the lines.

wtfomgjohnny
  • 25
  • 1
  • 5
0

Ok, I found it out. This is what I did:

data = {}
    for line in inlist:
        k, v = [x.strip() for x in line.split('\t')]
        data[k] = v

    for line in infile:
        infilecolumns = line.strip().split('\t')

        value1 = data.get(infilecolumns[2])
        value2 = data.get(infilecolumns[5])

        if value1:
            infilecolumns[2] = value1
        if value2:
            infilecolumns[5] = value2

        outfile.write('\t'.join(infilecolumns) + '\n')

This gives the desired output nice and easy. Thanks for all your answers, helped me a lot!

Edward Coelho
  • 235
  • 2
  • 14