0

I have a folder labeled cstruct with several with 20,000 .rsa files. In each of the files I need to extract each row that contain cys values and write the to a new file. Is there a way in python to loop through these files in this folder and extract this information?

RES SER A 102 17.74 15.2 17.22 22.0 0.52 1.4 11.89 24.5 5.85 8.6 RES HIS A 103 17.32 9.5 16.53 11.2 0.78 2.2 12.22 12.6 5.10 5.9 RES CYS A 104 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 RES LEU A 105 8.67 4.9 8.67 6.1 0.00 0.0 8.67 6.1 0.00 0.0 RES LEU A 106 5.72 3.2 5.72 4.1 0.00 0.0 5.72 4.0 0.00 0.0

  • what have you tried? where did you run into problems? https://stackoverflow.com/help/how-to-ask – hiro protagonist Aug 23 '15 at 18:53
  • This might be easier to do with `grep`, depending on your other requirements... – MikeTwo Aug 23 '15 at 19:01
  • or have a look at https://docs.python.org/3/library/glob.html#module-glob for finding `*.rsa` files and https://docs.python.org/3/library/re.html?highlight=re#module-re for extracting the data you want. – hiro protagonist Aug 23 '15 at 19:12

2 Answers2

0

Something like the following Python script should get you going in the right direction:

import re, glob

with open("output.txt", "w") as f_output:
    for rsa_file in glob.glob(r"cstruct\*.rsa"):
        with open(rsa_file, "r") as f_input:
            f_output.write(rsa_file + "\n")
            for row in f_input:
                for cys in re.findall(r"(RES CYS\s+\w+.*?)(?= RES|\Z)", row):
                    f_output.write(cys+"\n")
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
0

When you use the builtin open() command to open a file, and loop through it, by default Python loops over each line in the file:

dirName = "C:\\Wherever\\Your\\Files\\Are"
for rsafile in os.listdir(dirName):
    filepath = os.path.join(dirName, rsafile)     
    with open(filepath, "r") as f:
        for line in f:
            if "CYS" in line:
                print(line)

Depending on how your "rows" are defined, you might need to pull the relevant CYS substring out of each line after you identify the relevant lines.

Just for fun, I compared the speed of this method (if "pattern" in line) to the speed of a regex approach, re.search(".*CYS.*",line).
For small files, on my laptop, the Python "in" operator was ~91x faster, on average (100 iterations).
Regex re.search run time: 0.093 seconds.
"in" operator run time: 0.001 seconds.
That was timed with the timeit module. The timing data include file open/close overhead, so that difference is entirely due to the matching method.

  • the code works nicely, but I need it to print all output to a .txt file instead of console – biochem623 Aug 24 '15 at 14:40
  • See [this answer](http://stackoverflow.com/questions/6159900/correct-way-to-write-line-to-file-in-python/6160082#6160082) under "Correct way to write line to file in Python", with links to documentation on `open()`. Use the [file object](https://docs.python.org/3.3/glossary.html#term-file-object)'s write method: `o = open('outputFile.txt','a'); o.write(line)` You might have to tack on extra newlines, like: `o.write(line+'\n')` – saschP4-16 Aug 26 '15 at 01:57