1

I wrote a small script in python to concatenate some lines from different files into one file. But somehow it doesn't print anything I like it to print by the function I wrote. I tried to spot the problems, but after one evening and one morning, I still can't find the problem. Could somebody help me please? Thanks a lot!

So I have a folder where around thousands of .fa files are. In each of the .fa file, I would like to extract the line starting with ">", and also do some change to extract the information I like. In the end, I would like to combine all the information extracted from one file into one line in a new file, and then concatenate all the information from all the .fa file into one .txt file.

So the folder:

% ls
EstimatedSpeciesTree.nwk    HOG9998.fa          concatenate_gene_list_HOG.py
HOG9997.fa          HOG9999.fa          output

One .fa file for example

>BnaCnng71140D [BRANA]
MTSSFKLSDLEEVTTNAEKIQNDLLKEILTLNAKTEYLRQFLHGSSDKTFFKKHVPVVSYEDMKPYIERVADGEPSEIIS
GGPITKFLRRYSF

>Cadbaweo98702.t [THATH]
MTSSFKLSDLEEVTTNAEKIQNDLLKEILTLNAKTEYLRQFLHGSSDKTFFKKHVPVVSYEDMKPYIERVADGEPSEIIS
GGPITKFLRRYSF

What I would like to have is one file like this

HOG9997.fa BnaCnng71140D:BRANA Cadbaweo98702.t:THATH
HOG9998.fa Bkjfnglks098:BSFFE dsgdrgg09769.t
HOG9999.fa Dsdfdfgs1937:XDGBG Cadbaweo23425.t:THATH Dkkjewe098.t:THUGB
# NOTE: the number of lines in each .fa file are uncertain. Also, some lines has [ ], but some lines has not.

So my code is

#!/usr/bin/env python3

import os
import re
import csv


def concat_hogs(a_file):
    output = []
    for row in a_file:  # select the gene names in each HOG fasta file
        if row.startswith(">"):
            trans = row.partition(" ")[0].partition(">")[2]
            if row.partition(" ")[2].startswith("["):
                species = re.search(r"\[(.*?)\]", row).group(1)
            else:
                species = ""
            output.append(trans + ":" + species)

    return '\t'.join(output)



folder = "/tmp/Fasta/"
print("Concatenate names in " + folder)

for file in os.listdir(folder):
    if file.endswith('.fa'):
        reader = csv.reader(file, delimiter="\t")
        print(file + concat_hogs(reader))

But the output only prints the file name with out the part that should be generated by the function concat_hogs(file). I don't understand why.

zzz
  • 153
  • 8
  • Just to add a comment for my mistake in this code, in case somebody else also make the same mistake as me. So the problem of my code is as Daniil pointed out below that there is no a proper opening for the input file. So the solution is to add a line of `with open(a_file, 'r') as file:` after the `output = []` line, and take out the `reader = csv.reader(file, delimiter="\t")` line, this code would work. However, this code only works when this code is in the same folder as the input files. So if the code and the input are not in the same folder, better try the solution below using `pathlib`. – zzz Oct 06 '22 at 11:38

2 Answers2

2

Use standard Python libraries

In this case

  • regex (use a site such as regex101 to test your regex)
  • pathlib to encapsulate paths in a platform independent way
  • collections.namedtuple to make data more structured

A breakdown of the regex used here: >([a-z0-9A-Z\.]+?)\s*(\n|\[([A-Z]+)\]?\n)

  • > The start of block character
  • (regex1) First matchig block
  • \s* Any amount of whitespace (i.e. zero space is ok)
  • (regex2|regex3) A choice of two possible regex
  • regex1: + = One or more of characters in [class] Where class is any a to z or 0 to 9 or a dot
  • regex2: \n = A newline that immediately follows the whitespace
  • regex3: [([A-Z]+)] = One or more upper case letter inside square brackets

Note1: The brackets create capture groups, which we later use to split out the fields.
Note2: The regex demands zero or more whitespace between the first and second part of the text line, this makes it more resiliant.

import re
from collections import namedtuple
from pathlib import Path
import os

class HOG(namedtuple('HOG', ['filepath', 'filename', 'key', 'text'], defaults=[None])):
    __slots__ = ()
    def __str__(self):
        return f"{self.key}:{self.text}"
    
regex = re.compile(r">([a-z0-9A-Z\.]+?)\s*(\n|\[([A-Z]+)\]?\n)")

path = Path(os.path.abspath("."))
wildcard = "*.fa"
files = list(path.glob("*.fa"))

print(f"Searching {path}/{wildcard} => found {len(files)} files")

data = {}

for file in files:
    print(f"Processing {file}")
    with file.open() as hf:
        text = hf.read(-1)
        
    matches = regex.findall(text)
    
    for match in matches:
        key = match[0].strip()
        text = match[-1].strip()
        if file.name not in data:
            data[file.name] = []
            
        data[file.name].append(HOG(path, file.name, key, text))

print("Now you have the data you can process it as you like")
for file, entries in data.items():
    line = "\t".join(list(str(e) for e in entries))
    print(file, line)

# e.g. Write the output as desired
with Path("output").open("w") as fh:
    for file, entries in data.items():
        line = "\t".join(list(str(e) for e in entries))
        fh.write(f"{file}\t{line}\n") 
Jay M
  • 3,736
  • 1
  • 24
  • 33
  • I like the idea of immediately applying the regular expression to the entire file with `findall` as opposed to individual lines. It's probably faster that way, even though you still iterate over the results. The line-by-line approach still has the benefit of not needing to read the entire file into memory. Though I am guessing in this case this will not be an issue. – Daniil Fajnberg Oct 04 '22 at 13:56
  • 1
    Yep for extrememely large files reading it all into RAM might not be the best way. 32 bit linux/python has a 2G file limit, so smaller that most RAM+swap on modern machines. 64bit linux/python is (I think) 512TB? Assuming a very large swap, that could just theoretically exist on a high end PC but not on the average PC system. To handle really large files you'd have to use pages with open(buffering=xxx) and keep track of the file pointer in code rather than let the OS do it for you. – Jay M Oct 04 '22 at 15:41
  • Thank you so much for explaining all the details! I learned so much from your explanation! – zzz Oct 04 '22 at 22:10
  • 2
    Well that's exactly what stackoverflow is for. It's not supposed to be a 'fix it for me' more 'help me understand how to fix it, and build knowledge'. – Jay M Oct 05 '22 at 08:25
2

The error comes from you passing the name of the file to your concat_hogs function instead of an iterable file handle. You are missing the actual opening of the file for reading purposes.


I agree with Jay M that your code can be simplified drastically, not least by using regular expressions more efficiently. Also pathlib is awesome.

But I think it can be even more concise and expressive. Here is my suggestion:

#!/usr/bin/env python3
import re
from pathlib import Path


GENE_PATTERN = re.compile(
    r"^>(?P<trans>[\w.]+)\s+(?:\[(?P<species>\w+)])?"
)


def extract_gene(string: str) -> str:
    match = re.search(GENE_PATTERN, string)
    return ":".join(match.groups(default=""))


def concat_hogs(file_path: Path) -> str:
    with file_path.open("r") as file:
        return '\t'.join(
            extract_gene(row)
            for row in file
            if row.startswith(">")
        )


def main() -> None:
    folder = Path("/tmp/Fasta/")
    print("Concatenate names in", folder)
    for element in folder.iterdir():
        if element.is_file() and element.suffix == ".fa":
            print(element.name, concat_hogs(element))


if __name__ == '__main__':
    main()

I am using named capturing groups for the regular expression because I prefer it for readability and usability later on.

Also I assume that the first group can only contain letters, digits and dots. Adjust the pattern, if there are more options.

PS

Just to add a few additional explanations:

The pathlib module is a great tool for any basic filesystem-related task. Among a few other useful methods you can look up there, I use the Path.iterdir method, which just iterates over elements in that directory instead of creating an entire list of them in memory first the way os.listdir does.

The RegEx Match.groups method returns a tuple of the matched groups, the default parameter allows setting the value when a group was not matched. I put an empty string there, so that I can always simply str.join the groups, even if the species-group was not found. Note that this .groups call will result in an AttributeError, if no match was found because then the match variable will be set to None. It may or may not be useful for you to catch this error.

For a few additional pointers about using regular expressions in Python, there is a great How-To-Guide in the docs. In addition I can only agree with Jay M about how useful regex101.com is, regardless of language specifics. Also, I think I would recommend using his approach of reading the entire file into memory as a single string first and then using re.findall on it to grab all matches at once. That is probably much more efficient than going line-by-line, unless you are dealing with gigantic files.

In concat_hogs I pass a generator expression to str.join. This is more efficient than first creating a list and passing that to join because no additional memory needs to be allocated. This is possible because str.join accepts any iterable of strings and that generator expression (... for ... in ...) returns a Generator, which inherits from Iterator and thus from Iterable. For more insight about the container inheritance structures I always refer to the collections.abc docs.

Daniil Fajnberg
  • 12,753
  • 2
  • 10
  • 41
  • Great suggestion on the use of named groups – Jay M Oct 04 '22 at 13:33
  • Thank you so much for pointing out my error! I didn't notice that the csv.reader would need an additional open command. I'm still at the very beginning of learning python, and thank you for giving me a more elegant script! I haven't learned about things like `_ _name_ _` and `(string: str) -> `. I will go and google a bit. Thanks again! – zzz Oct 04 '22 at 22:09
  • 1
    @zzz Glad I could help. [Here](https://stackoverflow.com/questions/419163) is the most popular Python question about `__name__`. And [here](https://stackoverflow.com/questions/32557920) is the one about type hints/annotations. I'll add a few other brief notes to my answer. – Daniil Fajnberg Oct 05 '22 at 08:21