0

I need to parse a very large text file line by line, apply some text manipulation to it, .replace() etc. then populate these results into an array by index.

I need this to be returned to the global scope as I need it to be a function, this will be a module I will use in a project.

I am using readline as I would like to not have to use outside libraries and like to always use as close to base JS as possible.

I am also using TypeScript in my minimal example here:

import fs from 'fs';
import readline from 'readline';

const streamParseGEOgds = async (inputFile: string, verbose: boolean) => {
    let datasets: string[] = [];

    let line_counter = 0;
    let entry_counter = -1;
    
    return await new Promise((resolve, reject) => {
        const rl = readline.createInterface({
            input: fs.createReadStream(inputFile),
            crlfDelay: Infinity
        });

        rl.on('line', (line: string) => {
            line_counter++;

            if (line === "") {
                entry_counter++;

                // datasets[entry_counter] = "";
            }

            // ------------------------------
            const summary_regex: RegExp = /^[0-9]+. /;
            if (summary_regex.test(line)) {
                datasets[entry_counter] = line.replace(summary_regex, "");
            }

            // ------------------------------
            if (verbose) {
                console.log(`streamParseGEOgds line ${line_counter}: ${line}`);
            }
        }).on('close', () => {
            // I can log the object from inside here but I want to get out of this scope to the global scope
            resolve(datasets);
        });

        return datasets;
    });
}

const res = streamParseGEOgds("./gds_result.txt", false).then(async (datasets) => {
    console.log(datasets); // logs what I want; the parsed data
    return await datasets;
})

console.log(`Logging the res object: ${res}`); // returns a pending promise and does not return the actual data I want

My input file looks something like this:


1. Glycosylated clusterin species facilitate amyloid beta toxicity in human neurons.
(Submitter supplied) Clusterin (CLU) is one of the most significant genetic risk factors for late onset Alzheimer’s disease. Numerous studies have now demonstrated that CLU-AD mutations and amyloid-β (Aβ) treatment alter the trafficking and localisation of glycosylated CLU. iPSCs with altered CLU trafficking were generated following the removal of CLU exon 2 by CRISPR/Cas9 gene editing. Neurons were generated from control, unedited and exon 2 -/- iPSCs and were incubated with aggregated Aβ peptides. more...
Organism:   Homo sapiens
Type:       Expression profiling by high throughput sequencing
Platform: GPL24676 18 Samples
FTP download: GEO (TXT) ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE207nnn/GSE207466/
Series      Accession: GSE207466    ID: 200207466

2. CROPSeq of Putative AD- and PSP-associated cis-Regulatory Regions in iPSC-derived Neurons and Microglia
(Submitter supplied) We performed a pooled CRISPRi screen (CROP-seq) and genome editing to validate 19 genetic variants prioritized from massively parallel reporter assays to screen 5,706 polymorphisms from genome-wide association studies for both Alzheimer’s disease (AD) and Progressive Supranuclear Palsy (PSP) across 11 distinct loci. This allowed us to pinpoint regulatory targets in a cell-type specific manner.
Organism:   Homo sapiens
Type:       Expression profiling by high throughput sequencing; Other
Platform: GPL24676 4 Samples
FTP download: GEO ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE207nnn/GSE207099/
Series      Accession: GSE207099    ID: 200207099

3. Spermidine reduces neuroinflammation and soluble amyloid beta in an Alzheimer’s disease mouse model
(Submitter supplied) Deposition of amyloid beta (Aβ) and hyperphosphorylated tau along with glial cell-mediated neuroinflammation are prominent pathogenic hallmarks of Alzheimer’s disease (AD). In recent years, impairment of autophagy has been found to be another important feature contributing to AD progression. Therefore, the potential of the autophagy activator spermidine, a small body-endogenous polyamine often used as dietary supplement, was assessed on Aβ pathology and glial cell-mediated neuroinflammation. more...
Organism:   Mus musculus
Type:       Expression profiling by high throughput sequencing
Platform: GPL24247 8 Samples
FTP download: GEO (H5, RDS) ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE206nnn/GSE206202/
Series      Accession: GSE206202    ID: 200206202

The first and last lines are always empty, this is important as I use this for parsing the information.

Can someone help me please? I am getting a promise back and cannot find a way to get my data outside of the callback scope. I understand that my information is all there but I really need it to return to global scope so I can return this at the end as the array of parsed data.

Finally here is my tsconfig.json:

{
    "compilerOptions": {
        "target": "ES2020",
        "module": "CommonJS",
        "esModuleInterop": true,
        "forceConsistentCasingInFileNames": true,
        "strict": true,
        "skipLibCheck": true
    }
}
  • 1
    "*I am getting a promise back and cannot find a way to get my data outside of the callback scope*" - you never will. Move your code that needs the data inside the callback. There is no way around that. Learn to embrace promises instead of fighting them! – Bergi Jul 08 '22 at 19:35
  • I did read this somewhere... that is very unfortunate. Does this mean that in effect I will have use something other than the `readline` module which works asynchronously? How could I refactor so I return to the global scope? I appreciate your help. – dereckmezquita Jul 08 '22 at 19:37
  • Why do you think you need a variable in the global scope in the first place? – Bergi Jul 08 '22 at 19:39
  • I want to have this be a simple function I can call as a module. When I use the function I want the returned result to be an array with the actual data and not a promise. I thought that if I have the datasets array in the global scope then I could somehow read the file do my things and then return at the end of my function - which I've discovered due to the asynchronous nature I cannot. – dereckmezquita Jul 08 '22 at 19:42
  • But it *should* be a function that returns a promise. There's nothing wrong with it. In the place where you import and call that function, just use `.then()` or `await` - chances are high in a large application that the calling function already is asynchronous anyway. – Bergi Jul 08 '22 at 19:45
  • Yes that’s true, I could continue on with the promise and work from within the callback thereafter. In essence my current question now boils down to: if I absolutely wanted an array returned and not a promise, would I have to scrape `readline`, if so how else could I solve this same problem? – dereckmezquita Jul 08 '22 at 20:01
  • You could just call [`fs.readFileSync`](https://nodejs.org/api/fs.html#fsreadfilesyncpath-options), but yes, you really shouldn't do that. – Bergi Jul 08 '22 at 20:06
  • Thank you very much for your help. My final questions (2) and then I’m off, unfortunately I note that `readFileSync` loads the whole of the file in memory. Is there a way to parse a text file line by line synchronously, yet not load all of it into memory at once. These are very large text files I am dealing with. Finally what are the advantages of sticking to the async/promise syntax? Is this more efficient computationally speaking? This is assuming a method for my first question here is possible. – dereckmezquita Jul 08 '22 at 20:15
  • I guess you could [open a file synchronously](https://nodejs.org/api/fs.html#fsopensyncpath-flags-mode) and then [read it chunk by chunk](https://nodejs.org/api/fs.html#fsreadsyncfd-buffer-offset-length-position), but that's not line by line - I don't know if there's any library doing that for you. But especially if the file is huge, you really want to read it asynchronously, to avoid blocking other parts of your application. – Bergi Jul 08 '22 at 20:23

0 Answers0