0

I'm working on a Python script to match the annotated GO IDs from a file, and than extract Level 2 Gene Ontology (GO) IDs from a hierarchy. However, I'm facing issues with extracting the correct Level 2 GO IDs and displaying the corresponding information. Here's what I have so far: I have used go-basic.obo file to classify terms for the annotated GO IDs and their frequencies (total number of go_id) in a file called file.txt. I Expected Output in excell file with following column

The GO ID itself. The corresponding term name from the GO hierarchy. The Level 2 GO ID in the hierarchy. The frequency of the GO ID from file.txt.

I tried with this code, but its not properly extracting go id at level 2 of corroespong annotated go id, here is the code import pandas as pd

Load your GO hierarchy from go-basic.obo

def parse_obo(file_path):
    terms = {}
    current_term = {}
    isterm = False
    with open(file_path, 'r') as f:
        for line in f:
            line = line.strip()
            isterm = isterm or line.startswith('id:')
            if isterm and ": " in line:
                key, value = line.split(': ', 1)
                if key == "id":
                    current_term = terms.setdefault(value, {})
                    current_term["id"] = value
                else:
                    current_term.setdefault(key, []).append(value)
    return terms

def make_hierarchy(terms):
    for term in list(terms.values()):
        term.setdefault("children", [])
        if "is_a" in term:
            for is_a in term["is_a"]:
                parent = is_a.split()[0]
                terms.setdefault(parent, { 'id': parent }).setdefault("children", []).append(term)
    return [term for term in terms.values() if "is_a" not in term]

def calculate_go_frequency(file_path):
    go_frequency = {}
    with open(file_path, 'r') as f:
        for line in f:
            _, go_id = line.strip().split('\t')
            go_frequency[go_id] = go_frequency.get(go_id, 0) + 1
    return go_frequency

def find_level_2_go_id(go_id, terms, hierarchy):
    for term in hierarchy:
        if term["id"] == go_id:
            return None
        
        level_2_id = None
        for child in term.get("children", []):
            if child["id"] == go_id:
                level_2_id = term["id"]
                break
            for sub_child in child.get("children", []):
                if sub_child["id"] == go_id:
                    level_2_id = child["id"]
                    break
            if level_2_id:
                break
        
        if level_2_id:
            return level_2_id
        else:
            level_2_id = find_level_2_go_id(go_id, terms, term.get("children", []))
            if level_2_id:
                return level_2_id
    
    return None

if __name__ == "__main__":
    file_path = 'go-basic.obo'
    terms = parse_obo(file_path)
    roots = make_hierarchy(terms)
    
    # Read and process the file.txt file
    unique_file_path = 'file.txt'
    unique_go_frequency = calculate_go_frequency(unique_file_path)
    
    # Create a list of dictionaries for the data
    data = []
    for go_id, frequency in unique_go_frequency.items():
        term = terms.get(go_id, {})
        term_name = term.get("name", "unknown_term")
        
        # Get the Level 2 GO ID based on the hierarchy
        level_2_id = find_level_2_go_id(go_id, terms, roots)
        
        data.append({
            "GO ID": go_id,
            "Term": term_name,
            "Level 2 GO ID": level_2_id,
            "Frequency": frequency
        })
    
    # Create a DataFrame from the data
    df = pd.DataFrame(data)
    
    # Save the DataFrame to an Excel file
    output_excel = 'gene_counts_with_levels.xlsx'
    df.to_excel(output_excel, index=False)

    print(f"Data saved to {output_excel}")

kindly help this question is also following the previous question Retrieving Gene Ontology Hierarchy Using Python from this example file, level will be classify like this, given below

level1 GO:0048311
  level2 GO:0000001
level1 GO:0022411
  level2 GO:1903008
    level3 GO:0090166
level1 GO:0016043
  level2 GO:0006996
    level3 GO:0048308
      level4 GO:0000001
      level4 GO:0048309
      level4 GO:0048313
    level3 GO:0007029
      level4 GO:0048309
    level3 GO:0007030
      level4 GO:0048313
      level4 GO:0090166
    level3 GO:1903008
      level4 GO:0090166

this is my file

id  GO_ID
OGOAOAJO_00443  GO:0006996
OGOAOAJO_00021  GO:0048313
OGOAOAJO_00021  GO:0048308
OGOAOAJO_00830  GO:0048308
OGOAOAJO_00830  GO:0048308
OGOAOAJO_02897  GO:0000001
OGOAOAJO_02897  GO:0000001
OGOAOAJO_00517  GO:0000001
OGOAOAJO_01362  GO:0000001
OGOAOAJO_02172  GO:0048309
OGOAOAJO_02172  GO:0048313
OGOAOAJO_03684  GO:0007029
OGOAOAJO_03684  GO:0000166
OGOAOAJO_03684  GO:0048309
OGOAOAJO_03684  GO:0007030
OGOAOAJO_01125  GO:0048313
OGOAOAJO_01125  GO:0048313
OGOAOAJO_00657  GO:0090166
OGOAOAJO_00223  GO:0090166
OGOAOAJO_00223  GO:0090166
OGOAOAJO_03140  GO:1903008
OGOAOAJO_03140  GO:0090166
OGOAOAJO_03647  GO:0090166
OGOAOAJO_03407  GO:0090166
OGOAOAJO_00603  GO:0048311
OGOAOAJO_00603  GO:0048311
OGOAOAJO_01401  GO:0000001
OGOAOAJO_00603  GO:0000001

my output is like thsi

 GO ID  Term    Level 2 GO ID   Frequency
GO:0006996  ['organelle organization']  GO:0016043  1
GO:0048313  ['Golgi inheritance']   GO:0048308  4
GO:0048308  ['organelle inheritance']   GO:0006996  3
GO:0000001  ['mitochondrion inheritance']   GO:0048311  6
GO:0048309  ['endoplasmic reticulum inheritance']   GO:0048308  2
GO:0007029  ['endoplasmic reticulum organization']  GO:0006996  1
GO:0007030  ['Golgi organization']  GO:0006996  1
GO:0090166  ['Golgi disassembly']   GO:1903008  6
GO:1903008  ['organelle disassembly']   GO:0022411  1
GO:0048311  unknown_term        2

but problm is that in Level 2 GO ID go:id is not coming as aspected in level 2, as it can be seen that first row Level 2 GO ID should be GO:0006996 as a level2 but it is GO:0016043 which is level1,

correct output would be

GO ID  Term    Level 2 GO ID   Frequency
GO:0006996  ['organelle organization']  GO:0006996  1
GO:0048313  ['Golgi inheritance']   GO:0006996  4
GO:0048308  ['organelle inheritance']   GO:0006996  3
GO:0000001  ['mitochondrion inheritance']   GO:0000001  6
GO:0000001  ['mitochondrion inheritance']   GO:0006996  6
GO:0048309  ['endoplasmic reticulum inheritance']   GO:0006996  2
GO:0007029  ['endoplasmic reticulum organization']  GO:0006996  1
GO:0007030  ['Golgi organization']  GO:0006996  1
GO:0090166  ['Golgi disassembly']   GO:1903008  6
GO:0090166  ['Golgi disassembly']   GO:0006996  6
GO:1903008  ['organelle disassembly']   GO:1903008  1
GO:1903008  ['organelle disassembly']   GO:0006996  1
GO:0048311  unknown_term        2
Umar
  • 117
  • 7
  • Can you add what is in your unique.txt file? – trincot Aug 12 '23 at 18:58
  • What exactly is "frequency" here? The number of times a GO ID appears as `id:` value, or as `is_a` value, or both? What do you mean with "level 2". Can you include in your question the actual output you expect for the example go-basic.obo file? – trincot Aug 12 '23 at 19:01
  • @trincot thank you for replying, I edited, since data is huge, so I tried to expalin in a example data set – Umar Aug 12 '23 at 20:46
  • What would be the correct output that is expected? Is it only the first line that is wrong? Is it expected that the last line has an empty level 2 value? – trincot Aug 12 '23 at 21:05
  • Thank you @trincot , I edited, I hope its fine – Umar Aug 12 '23 at 21:24
  • @trincot, any update – Umar Aug 13 '23 at 06:43
  • @trincot, thank you its working, if i edit this question little bit, will u help me, just to see difference – Umar Aug 13 '23 at 16:32
  • I wouldn't edit the question if the intention is to get more help. Instead, first try to achieve your next goal yourself, and if you really cannot make it work, ask a new question targeted at where you got stuck in trying to achieve that next target. – trincot Aug 13 '23 at 16:37
  • sure I will try @Thank you – Umar Aug 13 '23 at 18:45
  • @trincot, I noticed that when I executed the code, I didn't observe any repetition in the results. For instance: GO:0000001 ['mitochondrion inheritance'] GO:0000001 6 GO:0000001 ['mitochondrion inheritance'] GO:0006996 6 However, I expected to see some repetition in the output. Specifically, the entry "GO:0000001 ['mitochondrion inheritance']" appears twice in the hierarchy, but it's printed only once. The desired behavior is to print it twice, each time with a different Level 2 GO ID. Could you please take a look and advise on how to achieve the desired output? Thank you. – Umar Aug 17 '23 at 12:45
  • OK, but that seems a different question. One problem at a time ;-) – trincot Aug 17 '23 at 12:50
  • @trincot, I am sorry but i didnt change question, its a same question, that time i didint see it poperly – Umar Aug 17 '23 at 12:51
  • You wrote: *"but problm is that in Level 2 GO ID go:id is not coming as aspected in level 2, as it can be seen that first row Level 2 GO ID should be GO:0006996 as a level2 but it is GO:0016043 which is level1"*. I thought *that* was your question. Not that it didn't output GO:0000001 only once. – trincot Aug 17 '23 at 12:54
  • Also, if you expect double lines then why the desired input does not include a second line for `GO:1903008` (with `GO:0006996` as second level)? – trincot Aug 17 '23 at 13:23
  • @trincot, "I apologize for any confusion. Thank you for pointing out the missing second line for GO:1903008 with GO:0006996 as the second level. I will make sure to include it in my desired results. Your feedback is much appreciated." – Umar Aug 17 '23 at 13:26
  • 1
    @trincot, i edited – Umar Aug 17 '23 at 13:37

1 Answers1

1

This piece of code:

        for child in term.get("children", []):
            if child["id"] == go_id:
                level_2_id = term["id"]
                break

...may also set level2_id when term is a top-level term.

I would suggest a different approach and write a function that returns the possible root-to-node paths to the given id, i.e. where each path is a list of IDs where the last one in the list is the searched id. This is a more generic function and can be used to find an ID at any level in the hierarchy.

So replace find_level_2_go_id with this function:

def find_paths_go_id(go_id, terms):
    def get_ancestors(id, path):
        if id not in terms or "is_a" not in terms[id]:
            yield path
            return
        for parent in terms[id]["is_a"]:
            yield from get_ancestors(parent, [parent, *path])
                
    return get_ancestors(go_id, [go_id])

Add another function to extract a certain level from such paths:

def get_distinct_at_depth(paths, depth):
    return { path[depth] if depth < len(path) else None for path in paths }

And call it like this:

    for go_id, frequency in unique_go_frequency.items():
        if not go_id.startswith("GO:"):
            continue
        term = terms.get(go_id, {})
        term_name = term.get("name", "unknown_term")
        
        # Get the Level 2 GO ID based on the hierarchy
        level_2_nodes = get_distinct_at_depth(find_paths_go_id(go_id, terms), 1)
        
        for level_2_id in level_2_nodes:
            data.append({
                "GO ID": go_id,
                "Term": term_name,
                "Level 2 GO ID": level_2_id,
                "Frequency": frequency
            })

In the existing parse_obo make sure to skip the part after "!" (like in the "is_a" properties):

                else:
                    # skip the part after "!"
                    current_term.setdefault(key, []).append(value.split(" !")[0])

All changes applied

...result in this code:

import pandas as pd

def parse_obo(file_path):
    terms = {}
    current_term = {}
    isterm = False
    with open(file_path, 'r') as f:
        for line in f:
            line = line.strip()
            isterm = isterm or line.startswith('id:')
            if isterm and ": " in line:
                key, value = line.split(': ', 1)
                if key == "id":
                    current_term = terms.setdefault(value, {})
                    current_term["id"] = value.split()[0]  # only get id
                else:
                    # skip the part after "!"
                    current_term.setdefault(key, []).append(value.split(" !")[0])
    return terms

def make_hierarchy(terms):  # not used
    for term in list(terms.values()):
        term.setdefault("children", [])
        if "is_a" in term:
            for is_a in term["is_a"]:
                parent = is_a.split()[0]
                terms.setdefault(parent, { 'id': parent }).setdefault("children", []).append(term)
    return [term for term in terms.values() if "is_a" not in term]

def calculate_go_frequency(file_path):
    go_frequency = {}
    with open(file_path, 'r') as f:
        for line in f:
            _, go_id = line.strip().split('\t')
            go_frequency[go_id] = go_frequency.get(go_id, 0) + 1
    return go_frequency

def find_paths_go_id(go_id, terms):
    def get_ancestors(id, path):
        if id not in terms or "is_a" not in terms[id]:
            yield path
            return
        for parent in terms[id]["is_a"]:
            yield from get_ancestors(parent, [parent, *path])
                
    return get_ancestors(go_id, [go_id])

def get_distinct_at_depth(paths, depth):
    return { path[depth] if depth < len(path) else None for path in paths }

if __name__ == "__main__":
    file_path = 'go-basic.obo'
    terms = parse_obo(file_path)
    # Hierarchy is not needed?
    # roots = make_hierarchy(terms)
    
    # Read and process the unique.txt file
    unique_file_path = 'unique.txt'
    unique_go_frequency = calculate_go_frequency(unique_file_path)

    # Create a list of dictionaries for the data
    data = []
    for go_id, frequency in unique_go_frequency.items():
        if not go_id.startswith("GO:"):
            continue
        term = terms.get(go_id, {})
        term_name = term.get("name", "unknown_term")
        
        # Get the Level 2 GO ID based on the hierarchy
        level_2_nodes = get_distinct_at_depth(find_paths_go_id(go_id, terms), 1)
        
        for level_2_id in level_2_nodes:
            data.append({
                "GO ID": go_id,
                "Term": term_name,
                "Level 2 GO ID": level_2_id,
                "Frequency": frequency
            })
            
    # Create a DataFrame from the data
    df = pd.DataFrame(data)
    
    pd.set_option('display.max_columns', 7)
    print(df)

Running this program df prints out as follows:

         GO ID                                  Term Level 2 GO ID  Frequency
0   GO:0006996              [organelle organization]    GO:0006996          1
1   GO:0048313                   [Golgi inheritance]    GO:0006996          4
2   GO:0048308               [organelle inheritance]    GO:0006996          3
3   GO:0000001           [mitochondrion inheritance]    GO:0006996          6
4   GO:0000001           [mitochondrion inheritance]    GO:0000001          6
5   GO:0048309   [endoplasmic reticulum inheritance]    GO:0006996          2
6   GO:0007029  [endoplasmic reticulum organization]    GO:0006996          1
7   GO:0000166                          unknown_term          None          1
8   GO:0007030                  [Golgi organization]    GO:0006996          1
9   GO:0090166                   [Golgi disassembly]    GO:1903008          6
10  GO:0090166                   [Golgi disassembly]    GO:0006996          6
11  GO:1903008               [organelle disassembly]    GO:1903008          1
12  GO:1903008               [organelle disassembly]    GO:0006996          1
13  GO:0048311                          unknown_term          None          2

Note that for this you don't need the make_hierarchy function, but maybe you need it for other purposes.

trincot
  • 317,000
  • 35
  • 244
  • 286
  • Updated the answer in light of requirements to potentially generate multiple level 2 nodes when node occurs multiple times. – trincot Aug 17 '23 at 13:42
  • I added the output I get. – trincot Aug 17 '23 at 13:56
  • can I get your mail id – Umar Aug 17 '23 at 14:01
  • See my profile. – trincot Aug 17 '23 at 14:03
  • I appreciate your response. However, I'm still encountering the same issue where the query ID is only appearing once. – Umar Aug 17 '23 at 17:55
  • I cannot reproduce this. I have added the output I get in my answer. If you mean you also get this, but have *different* input for which you see a problem, then please provide a link to that input, so I can run the program with that data. I suppose with "query ID", you mean "GO id"? – trincot Aug 17 '23 at 17:58
  • Just a side note: if the GO id appears twice in the input, but they share the same second-level id, then this code will only output that id once. Just make sure you are not referring to such a case? – trincot Aug 17 '23 at 18:03
  • I am looking for distinct GO IDs at the second-level hierarchy. If a GO ID appears twice in the input and they share the same second-level ID, the code should output that ID only once. However, if the hierarchy is different, I want to display all the distinct GO IDs. – Umar Aug 17 '23 at 18:41
  • That's what I understood. So for which input you say it doesn't work? – trincot Aug 17 '23 at 18:43
  • I provide in mail, if possible, could u please check the output , i got ,a nd the code i gave, that code i edited as per you suggested, – Umar Aug 17 '23 at 18:56
  • Sorry, can you please document it here? Remember that other people are following this question, and future readers will need to know. Please provide the info here. Make sure to provide a *minimal* case of input where you can reproduce a problem. You can add it to your question. – trincot Aug 17 '23 at 18:59
  • 1
    Possibly you have not applied the same changes as I have indicated in this answer. So I added the revised code in its entirety. – trincot Aug 17 '23 at 19:16
  • I apologize for any confusion. Thank you for providing the revised code. The code you posted worked as expected and produced the desired output, that's great! – Umar Aug 17 '23 at 19:22
  • I am facing one more problm with this code, in term column, name is not appearing full, its getting cut , such as it coming [organelle], but is should be ['organelle organization'] – Umar Aug 19 '23 at 05:23
  • I see. Code updated. – trincot Aug 19 '23 at 06:06
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254964/discussion-between-umar-and-trincot). – Umar Aug 19 '23 at 06:25