How to compare a list to a column in a data frame and print all rows where the list matches the column

Question

I have a data frame where one of the columns lists the gene my genetic mutations are associated with (last column).

0    chr1    6667742        T  TTC          HIGH             frameshift_variant     DNAJC11
1    chr1    8360467        G   GC          HIGH             frameshift_variant        RERE
2    chr1   10658519        T    A      MODERATE               missense_variant       CASZ1
3    chr1   12892965        T    G      MODERATE               missense_variant    PRAMEF10
4    chr1   14599118     AGCG    A      MODERATE  conservative_inframe_deletion        KAZN
..    ...        ...      ...  ...           ...                            ...         ...
443  chrX  131273813        G    C      MODERATE               missense_variant       IGSF1
444  chrX  141003622        A    G      MODERATE               missense_variant     SPANXB1
445  chrX  152919025  CGAGGAG    C      MODERATE    disruptive_inframe_deletion      ZNF185
446  chrX  152919025  CGAGGAG    C      MODERATE               sequence_feature      ZNF185
447  chrY   12722134       CA    C          HIGH             frameshift_variant       USP9Y

I also have a list of genes that I want to see if my data frame contains. I have been able to compare my list of genes to my data frame and print the genes that matched. However, what I am trying to do now is have the script print out the entire row where a match occurs so that I have all the information associated with that match.

I isolated the column containing the genes associated with each genetic mutation using.

gene_column=data_frame.iloc[:,6]

And compared that to the list of genes I am interested in, which I inputted from a txt file.

genes_of_interest_txt = open(r'E:\bcf_analysis\gene_list\met_associated_genes_new_line.txt', "r") #opens my list of genes written as each item on a new line 
genes_of_interest = genes_of_interest_txt.read() #reads next file
genes_of_interest_list = genes_of_interest.split ("\n") #makes text file a list

I then found all the matches using these nested for loops.

for i in genes_of_interest_list: 
    for num in gene_column: 
        if num == i:

Now I am trying to figure out how to print the whole row associated with the match. I am trying to build a flagging system thing to flag the rows where there is a match and then select all flag rows and output them into a new .csv file.


length_of_dataframe = 449
match_flag = np.zeros((length_of_file, 1), dtype=int, order='C')



num = int(0)


for i in genes_of_interest_list: 
    for num in gene_column:
        if num == i : 
            match_flag[num]= 1
            
print (match_flag)

I am getting the following error.

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

I am a total nooby at coding, so if you have a better method please let me know.

NOTE: I am using the numpy and pandas libraries.

thamuppet · Answer 1 · 2022-05-24T08:01:07.267

Not sure I follow your request. But do you mean something like this? This however means converting your data frame into text. Maybe this is not an option.

Code:

text = '''0    chr1    6667742        T  TTC          HIGH             frameshift_variant     DNAJC11
1    chr1    8360467        G   GC          HIGH             frameshift_variant        RERE
2    chr1   10658519        T    A      MODERATE               missense_variant       CASZ1
3    chr1   12892965        T    G      MODERATE               missense_variant    PRAMEF10
4    chr1   14599118     AGCG    A      MODERATE  conservative_inframe_deletion        KAZN
..    ...        ...      ...  ...           ...                            ...         ...
443  chrX  131273813        G    C      MODERATE               missense_variant       IGSF1
444  chrX  141003622        A    G      MODERATE               missense_variant     SPANXB1
445  chrX  152919025  CGAGGAG    C      MODERATE    disruptive_inframe_deletion      ZNF185
446  chrX  152919025  CGAGGAG    C      MODERATE               sequence_feature      ZNF185
447  chrY   12722134       CA    C          HIGH             frameshift_variant       USP9Y'''

split_text = text.split('\n') #split by rows

print('first example:')
for line in split_text:
    if "KAZN" in line:
        print(line)

print('\n')     
check_this = ['KAZN', 'ZNF185']

print('second example:')
for line in split_text:
    if any(x in line for x in check_this):
        print(line)

Output:

first example:
4    chr1   14599118     AGCG    A      MODERATE  conservative_inframe_deletion        KAZN

second example:
4    chr1   14599118     AGCG    A      MODERATE  conservative_inframe_deletion        KAZN
445  chrX  152919025  CGAGGAG    C      MODERATE    disruptive_inframe_deletion      ZNF185
446  chrX  152919025  CGAGGAG    C      MODERATE               sequence_feature      ZNF185

[Program finished]

Either look for phrase in line and print the line if it finds a match as in first example.

Second example looks for matches from a list of phrases.

Hi thank you, I think this will work. I am running into an issue though. on the following line: `data_frame_split_by_row = data_frame.split('\n')`. I am getting the error: `File "C:\Users\kcsha\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\generic.py", line 5274, in __getattr__ return object.__getattribute__(self, name) AttributeError: 'DataFrame' object has no attribute 'split' `. Is this because I am missing a library, or is it because my data frame is not a text file? — kshaff, May 24 '22 at 17:28
For this I think you wanna look up ```df.to_string()``` https://stackoverflow.com/questions/31247198/python-pandas-write-content-of-dataframe-into-text-file — thamuppet, May 24 '22 at 18:33
That worked thank you @thamuppet, but know all that I am getting is that my script is printing out my whole data frame. Here is my code, sorry don't know how to format it here. `data_frame_as_string = data_frame.to_string(header=True, index=False) data_frame_split_by_row = data_frame_as_string.split('\n') print('Found Genes:') for line in data_frame_split_by_row: if any(x in line for x in genes_of_interest_list): print (line) ` ` — kshaff, May 24 '22 at 19:06
Could be that there is only one line after converting to string. Read more into converting data frame to text/string. Sorry I can't be off more help. — thamuppet, May 26 '22 at 10:04

score 0 · Answer 2 · answered May 24 '22 at 07:42

If I'm not mistaken, you just want to save the dataframe consisting of all the matched genes into csv file? In this case, you can first do a list comprehension to obtain a list of matched genes, then use them to lookup to your dataframe.

matched_list = [num/i for i in genes_of_interest_list for num in gene_column if num == i] 
# I'm not sure which one expected output, you can change the `num` or `i` according to what you want

new_df = data_frame[data_frame['Your last column name'].isin(matched_list)]

new_df.to_csv("some_file_name.csv", index = None)

How to compare a list to a column in a data frame and print all rows where the list matches the column

2 Answers2