-1

I have a huge file file1.txt that looks like this :

POS 1 1 7 7 11 11 12  
1625 1 2 0 1 1 0 2
2864 2 2 1 2 1 2 1

The first row has column names (integers; note that some columns have the same names, although different values in rows), and columns are separated by a single empty space.

I have another file file2.txt with a subset of column names from file1.txt, one per line:

1
11
12

Using python, I want to get a file3.txt with all the columns from file1.txt that have matching column names of file2.txt:

POS 1 1 11 11 12  
1625 1 2 1 0 2
2864 2 2 1 2 1

I have tried the following:

import numpy as np
import pandas as pd

with open("file2.txt") as file_in:
lines = []
for line in file_in:
    lines.append(line)
    
df = pd.read_csv('file1.txt', sep=' ')

for x in df.columns: 
# add line of code
        lines.append(x) 
print (lines)

I found a similar bash solution here: Extract columns from a file based on header selected from another file

However, this solution doesn't seem to apply for columns with duplicate names.

Lucas
  • 1,139
  • 3
  • 11
  • 23

1 Answers1

1

I hope following code helps

import pandas as pd

with open("b.txt") as file_in:
    lines = []
    for line in file_in:
        lines.append(line.strip())
print(lines)
# Lines is like ['1', '11', '12']
df = pd.read_csv('a.txt', sep=' ', dtype=str, header=None)
print(df)
"""
df is like 
      0  1  2  3  4   5   6   7
0   POS  1  1  7  7  11  11  12
1  1625  1  2  0  1   1   0   2
2  2864  2  2  1  2   1   2   1


"""
requirement = []
idx = 0
for x in list(df.iloc[0, :]):
    if x in lines:
        k = list(df.iloc[:, idx])
        requirement.append(k)
    idx += 1
print(requirement)
"""requirement is like 
[['1', '1', '2'], ['1', '2', '2'], ['11', '1', '1'], ['11', '0', '2'], ['12', '2', '1']]
"""
print(pd.DataFrame(requirement).T)
"""pd.DataFrame(requirement).T is like
   0  1   2   3   4
0  1  1  11  11  12
1  1  2   1   0   2
2  2  2   1   2   1
"""
pd.DataFrame(requirement).T.to_csv("c.txt", header=False, index=False, sep=" ")
Danish Bansal
  • 608
  • 1
  • 7
  • 25
  • Hi @danish bansal. I still get the same output data frame with non-duplicated columns. Thank you – Lucas Sep 22 '21 at 18:40