Iterate column for matches in another column

Question

I have files that look like:

chr1:92092600   G[chr2:164084669[   ENSG00000189195 ENST00000342818 BTBD8   chr2:164084669
chr1:121498879  T[chr9:2781522[ ENSG00000233432 ENST00000425455 AL592494.2  chr9:2781522
chr2:101298260  ]chr3:196435392]A   ENSG00000163162 ENST00000295317 RNF149  chr3:196435392
chr2:164084669  ]chr1:92092600]G    ENSG00000237844 ENST00000429636 AC016766.1  chr1:92092600
chr9:2781522    ]chr1:121498879]T   ENSG00000080608 ENST00000490444 PUM3    chr1:121498879
chr3:196435392  A[chr2:101298260[   ENSG00000163960 ENST00000296328 UBXN7   chr2:101298260

And for every element in column 6 I would like to search column 1, and if present - print the entire line. So expected output for the first 3 elements in column 6 should look like:

chr2:164084669  ]chr1:92092600]G    ENSG00000237844 ENST00000429636 AC016766.1  chr1:92092600
chr9:2781522    ]chr1:121498879]T   ENSG00000080608 ENST00000490444 PUM3    chr1:121498879
chr3:196435392  A[chr2:101298260[   ENSG00000163960 ENST00000296328 UBXN7   chr2:101298260

So far I have:

import pandas as pd

pd.options.display.max_colwidth = 100
file =  open("data.txt", 'r')

chrA =[]
chrB = []
Bgenes = []

for line in file.readlines():
    chrA.append(line.split()[0])
    chrB.append(line.split()[5])
    for pos in chrB:
        if pos in chrA: 
            Bgenes.append(line)

why do you import `pandas` if the data doesn't even go in to a dataframe? — gold_cy, Apr 19 '19 at 14:36
also isn't every element from the 6th column in the 1st column? — gold_cy, Apr 19 '19 at 14:42

score 2 · Answer 1 · answered Apr 19 '19 at 15:06

You can also use list comprehension to find matches:

with open('data.txt', 'r') as f:
    lines = [line.split() for line in f.readlines()]

for line in lines:
    try:
        i = [x[0] for x in lines].index(line[5])
        print(' '.join(lines[i]))
    except IndexError:
        pass

Output:

chr2:164084669 ]chr1:92092600]G ENSG00000237844 ENST00000429636 AC016766.1 chr1:92092600
chr9:2781522 ]chr1:121498879]T ENSG00000080608 ENST00000490444 PUM3 chr1:121498879
chr3:196435392 A[chr2:101298260[ ENSG00000163960 ENST00000296328 UBXN7 chr2:101298260
chr1:92092600 G[chr2:164084669[ ENSG00000189195 ENST00000342818 BTBD8 chr2:164084669
chr1:121498879 T[chr9:2781522[ ENSG00000233432 ENST00000425455 AL592494.2 chr9:2781522
chr2:101298260 ]chr3:196435392]A ENSG00000163162 ENST00000295317 RNF149 chr3:196435392

Great, thanks @Alderven! One further question, for each element in column 6 that is not present in column 1, could you add a blank row? — lindak, Apr 19 '19 at 21:18
Or, maybe that's not necessary...I want to merge the output back to the indata and maybe that can be accomplished with pandas concat, so yielding NaNs instead of a blank line. — lindak, Apr 19 '19 at 22:15

score 1 · Answer 2 · answered Apr 19 '19 at 15:00

First put your data in a pandas DataFrame, than you can use this:

import pandas as pd

df = pd.DataFrame({"a": ["asdf", "qwer", "zxcv"],
                   "b": ["b_row_1", "b_row_2", "b_row_3"],
                   "c": ["ghjk", "qwer", "zxcv"]})

for index, row in df.iterrows():
    if row["c"] not in df["a"].tolist():
        df = df.drop(index)

The output should look like this:

      a        b     c
1  qwer  b_row_2  qwer
2  zxcv  b_row_3  zxcv

You can use something like this to read your file as a pandas DataFrame:

data = pd.read_csv('output_list.txt', sep=" ", header=None)
data.columns = ["a", "b", "c", "etc."]

Check these links:

Load data rom txt with pandas

How to iterate over rows in a datarame in pandas

Pandas dataframe drop

don't use `iterrows` since that defeats the purpose of `pandas` — gold_cy, Apr 19 '19 at 15:04

Sainath Motlakunta · Answer 3 · 2019-04-19T15:23:43.813

0

You need to use a separate "for" loop for collecting and another loop for searching.

lines=file.readlines()
for line in lines: 
    for line2 in lines:
         if line.split()[5] ==line2.split()[0]:
             Bgenes.append(line2)

I hope this helps :)

edited Apr 19 '19 at 15:23

answered Apr 19 '19 at 14:51

Sainath Motlakunta

935
5
8

score 0 · Answer 4 · answered Apr 19 '19 at 17:06

I assumed that your data is separable by comma(you can add). The reason is your original data is having different amount of white space. here is the code and screen shot of the result which is what you want i guess.

import pandas as pd
data1 = pd.read_csv('C:/data.csv', sep=',', header=None)
data2 = pd.read_csv('C:/data.csv', sep=',', header=None)
df1=pd.DataFrame(data1) # create FIRST dataframe
df2=pd.DataFrame(data2) # create SECODN dataframe

df1.columns=['1','2','3','4','5','ID'] #assinging ID to column 6
df2.columns=['ID','2','3','4','5','6'] #assingning ID to column 1

dfMerged1=pd.merge(df1, df2, on='ID', how='inner') 
dfMerged2=pd.merge(df2, dfMerged1, on='ID', how='inner')

dfCleaned=dfMerged2.iloc[:,0:6] #what you want at the end
print(dfCleaned)

Iterate column for matches in another column

4 Answers4