I have a huge file file1.txt that looks like this :
POS 1 1 7 7 11 11 12
1625 1 2 0 1 1 0 2
2864 2 2 1 2 1 2 1
The first row has column names (integers; note that some columns have the same names, although different values in rows), and columns are separated by a single empty space.
I have another file file2.txt with a subset of column names from file1.txt, one per line:
1
11
12
Using python, I want to get a file3.txt with all the columns from file1.txt that have matching column names of file2.txt:
POS 1 1 11 11 12
1625 1 2 1 0 2
2864 2 2 1 2 1
I have tried the following:
import numpy as np
import pandas as pd
with open("file2.txt") as file_in:
lines = []
for line in file_in:
lines.append(line)
df = pd.read_csv('file1.txt', sep=' ')
for x in df.columns:
# add line of code
lines.append(x)
print (lines)
I found a similar bash solution here: Extract columns from a file based on header selected from another file
However, this solution doesn't seem to apply for columns with duplicate names.