1

I try to convert an heavy .txt which is the string of a dicionary like (a part):

"{'A45171': {'Gen_n': 'Putative uncharacterized protein', 'Srce': 'UniProtKB', 'Ref': 'GO_REF:0000033', 'Tax': 'NCBITaxon:684364', 'Gen_bl': 'BATDEDRAFT_15336', 'Gen_id': 'UniProtKB:F4NTD6', 'Ev_n': 'IBA', 'GO_n': 'ergosterol biosynthetic process', 'GO': 'GO:0006696', 'Org': 'Batrachochytrium dendrobatidis JAM81', 'Type': 'protein', 'Ev_e': 'ECO:0000318', 'Con': 'GO_Central'}, 'A43886': {'Gen_n': 'Uncharacterized protein', 'Srce': 'UniProtKB', 'Ref': 'GO_REF:0000002', 'Tax': 'NCBITaxon:9823', 'Gen_bl': 'RDH8', 'Gen_id': 'UniProtKB:F1S3H8', 'Ev_n': 'IEA', 'GO_n': 'estrogen biosynthetic process', 'GO': 'GO:0006703', 'Org': 'Sus scrofa', 'Type': 'protein', 'Ev_e': 'ECO:0000501', 'Con': 'InterPro'}}"

I've tryed ast module:

import ast
dic_gene_definitions = open("Gene_Ontology/output_data/dic_gene_definitions.txt", "r")
dic_gene_definitions = dic_gene_definitions.read()
dic_gene_definitions = ast.literal_eval(dic_gene_definitions)

Which weight 22Mb and when don't crush, it runs so slow.

I really wants to open an 500 Mb files... I've look json module which can open so faster, but in heavy dictionary string it crash also (not with short examples).

Any solution...?

Thank you so much.

Community
  • 1
  • 1
  • What are the use cases you need a 500 Mb file loaded into memory for? – Frank C. May 10 '17 at 07:45
  • For a dictionary of protein annotations, which includes aminoacid secuences (aprox 300 bit for secuence + description, 66000 proteins). – Josep Llobet May 10 '17 at 10:53
  • Again, how will you use the dictionary... just to create the AST? I'm not a python guru by any means but would it make sense to create a hash list of keys that point to file offsets for the values and load that into memory first and, then, you could fault in the actual data value(s) when a key is referenced? – Frank C. May 11 '17 at 19:52
  • I don't understand much. I only takes an dictionary file from a string of a dictionary. I wants to save the data with a different key's inputs like an normal dictionary. But I takes the answer below. – Josep Llobet May 17 '17 at 11:18

1 Answers1

0

I've look some non-RAM memory consuming method.

By use in Ubuntu's terminal:

sudo swapon -s

we can appreciate the RAM memory consumed by different opperations:

Filename                Type        Size    Used    Priority
/swapfile               file        2097148 19876   -1

By to operate with this example file (500Mb) by make an dictionary, for example, best way is to open it from normal text tabulated data format, and operate with minour RAM consumption:

with open("Gene_Ontology/output_data/GO_annotations_dictionary.txt", "r") as handle:
    for record in handle.read().splitlines():
        anote = record.split("\t")

ast module is fine but not by large files.