1

I have two files (Examples: A.txt and B.txt), where "A.txt" is very large. I would like to avoid reading the full file into memory, and read it line by line before merging those matching from "B.txt". Both files have headers as well.

My current code looks like this:

import os
import pandas as pd

contigs=pd.read_csv("A.txt", header=0, sep="\t")
coverages=pd.read_csv("B.txt", header=0, sep="\t")
merged=pd.merge(contigs, coverages, on='contig')
merged.to_csv("merged_file.txt", sep="\t", index=False)

The code works, but as mentioned above I would like read "A.txt" line by line, instead of fully reading to memory, and merge with "B.txt", before writing it out.

Thanks a lot for your help!

(Updating original post with example files)

head A.txt
clusterID       kegg_contig     contig
Cluster_10700   Unassigned_ERR1801630_792963    ERR1801630_contig_792963
Cluster_10700   Unassigned_ERR1801633_537686    ERR1801633_contig_537686
Cluster_10700   Unassigned_ERR505054_53474      ERR505054_contig_53474
Cluster_10700   Unassigned_ERR505054_31574      ERR505054_contig_31574


head B.txt
contig  coverage
ERR1726751_contig_1     28.82716
ERR1726751_contig_2     12.265934
ERR1726751_contig_3     17.733767
Susheel Busi
  • 163
  • 8
  • See chunksize merge https://stackoverflow.com/questions/58441517/merging-dataframe-chunks-in-pandas file = "tableFile/123456.txt" initDF = pd.read_csv(file, sep="\t", header=0) file2 = "tableFile/7891011.txt" for chunks in pd.read_csv(file2, sep="\t", chunksize=50000, header=0): initDF = initDF.merge(chunks, how='right', on=['Mod', "Nuc", "AA"]) – Golden Lion Feb 08 '21 at 03:13
  • Thanks @GoldenLion. I tried this, but for whatever it is printing an `empty` dataframe at the end. – Susheel Busi Feb 08 '21 at 08:39
  • Will you post a sample of a and b data frame and I will test the code with the data – Golden Lion Feb 08 '21 at 11:34
  • Thank you @GoldenLion. Top few lines of both files are posted in the original – Susheel Busi Feb 08 '21 at 11:51
  • in the data sample there are no matches on the contig field – Golden Lion Feb 08 '21 at 12:20
  • As I said, the 2nd file is huge with lot of lines.. I only printed the top 'n' lines.. I can guarantee that there are matching lines further down.. That's the whole point of this question, i.e. to merge lines that exist and leave out the rest – Susheel Busi Feb 08 '21 at 12:36
  • there are many types of joins: inner, left, right, symmetrical, etc. What join pattern are you attempting – Golden Lion Feb 08 '21 at 18:34
  • I need the 'coverage' column from B.txt, to be added to A.txt, based on matching "contig" IDs – Susheel Busi Feb 09 '21 at 08:05

2 Answers2

0
 initDF = pd.read_csv("merge_a.csv", sep=",", header=0) 
 file2 = "merge_b.csv" 
 for chunks in pd.read_csv(file2, sep=",", chunksize=50, header=0): 
      print(chunks)
      initDF = initDF.merge(chunks, how='inner', on=['contig'])

 print(initDF)
Golden Lion
  • 3,840
  • 2
  • 26
  • 35
  • Thank you but I have an `empty` dataframe again. ``` >>> contigs=pd.read_csv("contigs/Cluster_10700_contigs.txt", sep="\t", header=0) >>> file2="coverage/merged_euci_coverages.txt" >>> for chunks in pd.read_csv(file2, sep="\t", chunksize=50000, header=0): ... contigs=contigs.merge(chunks, how='inner', on=['contig']) ... >>> contigs.head() Empty DataFrame ``` – Susheel Busi Feb 08 '21 at 12:37
  • I didn't see any matches on the contigs field for the data you provided – Golden Lion Feb 08 '21 at 15:41
0

For such a simple problem, it is easy to process one file line by line, provided the merge field(s) is (are) unique in the other one. For the remaining part of the answer, I will assume that contig is unique in B.txt:

import csv

# load B into a dictionary
with open('B.txt') as file_B:
    rd = csv.reader(file_B, delimiter='\t')
    _ = next(rd)             # skip header line
    dict_B = {row[0]: row[1] for row in rd}

# process file A line by line
with open('A.txt') as fdin, open('merged_file.txt', 'w', newline='') as fdout:
    rd = csv.reader(fdin, delimiter='\t')
    wr = csv.writer(fdout, delimiter='\t')
    # process header line
    row = next(rd)
    row.append('coverage')  # append last column header
    wr.writerow(row)
    # process data line
    for row in rd:
        row.append(dict_B[row[2]])  # append last field
        wr.writerow(row)

This only uses the csv module and does not use pandas. It will save a lot of memory, but could require more time...

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • Thank you, but getting the following error: ``` 26 Traceback (most recent call last): File "", line 10, in IndexError: list index out of range ``` – Susheel Busi Feb 08 '21 at 13:51