0

There are several files like this:

sample_a.txt containing:

a
b
c

sample_b.txt containing:

b
w
e

sample_c.txt containing:

a
m
n

I want to make a matrix of absence/presence like this:

            a    b    c    w    e    m    n
sample_a    1    1    1    0    0    0    0
sample_b    0    1    0    1    1    0    0
sample_c    1    0    0    0    0    1    1

I know a dirty and dumb way how to solve it: make up a list of all possible letters in those files, and then iteratively comparing each line of each file with this 'library' fill in the final matrix by index. But I guess there's a smarter solution. Any ideas?

Upd: the sample files can be of different length.

plnnvkv
  • 541
  • 4
  • 14
  • 2
    While reading the files, create a dictionary keyed by `sample_a`, `sample_b`, etc. and whose values are sets like `{a, m, n}`. Using a `defaultdict(set)` will help. Alternatively, use an `Ordered Dict` if order matters. Create the union of all such sets while reading. After that, it becomes a relatively simple conversion of such a dictionary to what you want. – John Coleman Jul 14 '20 at 11:37

1 Answers1

2

You can try:

import pandas as pd
from collections import defaultdict
dd = defaultdict(list) # dictionary where each value per key is a list

files = ["sample_a.txt","sample_b.txt","sample_c.txt"]
for file in files:
    with open(file,"r") as f:
        for row in f:
            dd[file.split(".")[0]].append(row[0]) 
            #appending to dictionary dd:
            #KEY: file.split(".")[0] is file name without extension
            #VALUE: row[0] is first character of line in text file
            # (second character was new line '\n' so I removed it)
    
df = pd.DataFrame.from_dict(dd, orient='index').T.melt() #converting dictionary to long format of dataframe
pd.crosstab(df.variable, df.value) #make crosstab, similar to pd.pivot_table

result:

value     a  b  c  e  f  m  n  o  p  w
variable                              
sample_a  1  1  1  0  0  0  0  0  0  0
sample_b  0  1  0  1  1  0  0  0  0  1
sample_c  1  0  0  0  0  1  1  1  1  0

Please note letters (columns) are in alphabetical order.

ipj
  • 3,488
  • 1
  • 14
  • 18
  • only works if arrays are of the same length, how can this be fixed? – plnnvkv Jul 15 '20 at 10:47
  • Can You update question to reflect this requirement, please? – ipj Jul 15 '20 at 10:56
  • I've proposed an edit to the question with data of different length, please accept edit. Then updated answer - now it works. If it's run as expected please consider accepting an answer as final solution. – ipj Jul 15 '20 at 11:35