I wish to merge multiple files with a single (f1.txt) file based on 2 column matches after comparison with that file. I can do it in pandas but it reads everything to memory which can get big really fast. I am thinking a line by line reading will not load everything into memory. Pandas is also not an option now. How do I perform the operation while filling in null for cells where a match with f1.txt does not occur?
Here, I used a dictionary, which I am not sure if it will hold in memory and I also can't find a way to add null where there is no match in the other files with f1.txt. The other files could be as many as 1000 different files. The time does not matter as long as I do not read everything to memory
FILES (tab-delimited)
f1.txt
A B num val scol
1 a1 1000 2 3
2 a2 456 7 2
3 a3 23 2 7
4 a4 800 7 3
5 a5 10 8 7
a1.txt
A B num val scol fcol dcol
1 a1 1000 2 3 0.2 0.77
2 a2 456 7 2 0.3 0.4
3 a3 23 2 7 0.5 0.6
4 a4 800 7 3 0.003 0.088
a2.txt
A B num val scol fcol2 dcol1
2 a2 456 7 2 0.7 0.8
4 a4 800 7 3 0.9 0.01
5 a5 10 8 7 0.03 0.07
Current Code
import os
import csv
m1 = os.getcwd() + '/f1.txt'
files_to_compare = [i for i in os.listdir('dir')]
dictionary = dict()
dictionary1 = dict()
with open(m1, 'rt') as a:
reader1 = csv.reader(a, delimiter='\t')
for x in files_to_compare:
with open(os.getcwd() + '/dir/' + x, 'rt') as b:
reader2 = csv.reader(b, delimiter='\t')
for row1 in list(reader1):
dictionary[row1[0]] = list()
dictionary1[row1[0]] = list(row1)
for row2 in list(reader2):
try:
dictionary[row2[0]].append(row2[5:])
except KeyError:
pass
print(dictionary)
print(dictionary1)
What I am trying to achieve is similar to using: df.merge(df1, on=['A','B'], how='left').fillna('null')
current result
{'A': [['fcol1', 'dcol1'], ['fcol', 'dcol']], '1': [['0.2', '0.77']], '2': [['0.7', '0.8'], ['0.3', '0.4']], '3': [['0.5', '0.6']], '4': [['0.9', '0.01'], ['0.003', '0.088']], '5': [['0.03', '0.07']]}
{'A': ['A', 'B', 'num', 'val', 'scol'], '1': ['1', 'a1', '1000', '2', '3'], '2': ['2', 'a2', '456', '7', '2'], '3': ['3', 'a3', '23', '2', '7'], '4': ['4', 'a4', '800', '7', '3'], '5': ['5', 'a5', '10', '8', '7']}
Desired result
{'A': [['fcol1', 'dcol1'], ['fcol', 'dcol']], '1': [['0.2', '0.77'],['null', 'null']], '2': [['0.7', '0.8'], ['0.3', '0.4']], '3': [['0.5', '0.6'],['null', 'null']], '4': [['0.9', '0.01'], ['0.003', '0.088']], '5': [['null', 'null'],['0.03', '0.07']]}
{'A': ['A', 'B', 'num', 'val', 'scol'], '1': ['1', 'a1', '1000', '2', '3'], '2': ['2', 'a2', '456', '7', '2'], '3': ['3', 'a3', '23', '2', '7'], '4': ['4', 'a4', '800', '7', '3'], '5': ['5', 'a5', '10', '8', '7']}
My final intent is to write the dictionary to a text file. I do not know how much memory will be used or if it will even fit in memory. if there is a better way without using pandas, that will be nice else how do I make dictionary work?
DASK ATTEMPT:
import dask.dataframe as dd
directory = 'input_dir/'
first_file = dd.read_csv('f1.txt', sep='\t')
df = dd.read_csv(directory + '*.txt', sep='\t')
df2 = dd.merge(first_file, df, on=[A, B])
I kept getting ValueError: Metadata mismatch found in 'from_delayed'
+-----------+--------------------+
| column | Found | Expected |
+--------------------------------+
| fcol | int64 | float64 |
+-----------+--------------------+
I googled, found similar complaints but could not fix it. That was why I decided to try this. Checked my files and all dtypes seem to be consistent. My version of dask was 2.9.1