0

I have two files, similar to the ones below:

File 1 - with phenotype informations, the first column are the individual, the orinal file has 400 rows:

215 2 25 13.8354303 15.2841303
222 2 25.2 15.8507278 17.2994278
216 2 28.2 13.0482192 14.4969192
223 11 15.4 9.2714745 11.6494745

File 2 - with SNPs information, the original file has 400 lines and 42,000 characters per line.

215          20211111201200125201212202220111202005111102
222          20111011212200025002211001111120211015112111
216          20210005201100025210212102210212201005101001
223          20222120201200125202202102210121201005010101
217          20211010202200025201202102210121201005010101
218          02022000252012021022101212010050101012021101

And I need to remove from file 2 individuals that do not appear in the file 1, for example:

215          20211111201200125201212202220111202005111102
222          20111011212200025002211001111120211015112111
216          20210005201100025210212102210212201005101001
223          20222120201200125202202102210121201005010101 

I could do this with this code:

awk 'NR==FNR{a[$1]; next}$1 in a{print $0}' file1 file2> file3

However, when I do my main analysis with the generated file the following error appears:

*** Error in `./airemlf90': free(): invalid size: 0x00007f5041cc2010 ***
*** Error in `./postGSf90': free(): invalid size: 0x00007fec4a04f010 ***

airemlf90 and postGSf90 are software. But when I use original file this problem does not occur. Does the command that I made to delete individuals is adequate? Another detail that did not say is that some individuals have identification with 4 characters, can be this the error?

Thanks

Greg Rov
  • 327
  • 3
  • 12

1 Answers1

1

I wrote a small python script in a few minutes. Works well, I have tested with 42000-char lines and it works fine.

import sys,re

# rudimentary argument parsing

file1 = sys.argv[1]
file2 = sys.argv[2]
file3 = sys.argv[3]

present = set()

# first read file 1, discard all fields except the first one (the key)
with open(file1,"r") as f1:
    for l in f1:
        toks = re.split("\s+",l)    # same as awk fields
        if toks:   # robustness against empty lines
            present.add(toks[0])

#now read second one and write in third one only if id is in the set

with open(file2,"r") as f2:
    with open(file3,"w") as f3:
        for l in f2:
            toks = re.split("\s+",l)
            if toks and toks[0] in present:
                f3.write(l)

(First install python if not already present.)

Call my sample script mytool.py and run it like this:

python mytool.py file1.txt file2.txt file3.txt

To process several files at once simply in a bash file (to replace the original solution) it's easy (although not optimal because could be done in a whirl in python)

<whatever the for loop you need>; do
  python my_tool.py $1 $2 $3
done

exactly like you would call awk with 3 files.

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • It works perfectly, great! But I need to do this with 9 phenotypes files and my main program is linked. The bash and python language they can not be used together? Thanks for your help and support! – Greg Rov Jul 25 '16 at 15:45
  • You can use this python script within a bash file of course if you feel confortable with it (of course, it could be done with 1 call of python too, but would require more argument parsing). – Jean-François Fabre Jul 25 '16 at 15:49
  • I tried to do this. Thank you, it's a very interesting solution. =) – Greg Rov Jul 25 '16 at 15:55
  • Good. Sometimes it's best to keep an open mind about the tools involved to solve your problem. Replies for python subjects are amazingly fast compared to awk, sed and other legacy stuff. – Jean-François Fabre Jul 26 '16 at 12:45