0

I am beginner/intermediate user working with python and when I write elaborate code (at least for me), I always try to rewrite it looking for reducing the number of lines when possible.

Here the code I have written.

It is basically read all values of one data frame looking for a specific string, if string found save index and value in a dictionary and drop rows where these string was found. And the same with next string...

##### Reading CSV file values and looking for variants IDs ######

# Find Variant ID (rs000000) in CSV
# \d+ is neccesary in case the line find a rs+something. rs\d+ looks for rs+ numbers
rs = df_draft[df_draft.apply(lambda x:x.str.contains("rs\d+"))].dropna(how='all').dropna(axis=1, how='all')

# Now, we save the results found in a dict key=index and value=variand ID
if rs.empty == False:
    ind = rs.index.to_list()
    vals = list(rs.stack().values)
    row2rs = dict(zip(ind, vals))
    print(row2rs)

# We need to remove the row where rs has been found.
# Because if in the same row more than one ID variant found (i.e rs# and NM_#)
# this code is going to get same variant more than one.
    for index, rs  in row2rs.items(): 
    
    # Rows where substring 'rs' has been found need to be delete to avoid repetition
    # This will be done in df_draft
        df_draft = df_draft.drop(index)
    
## Same thing with other ID variants

# Here with Variant ID (NM_0000000) in CSV
NM = df_draft[df_draft.apply(lambda x:x.str.contains("NM_\d+"))].dropna(how='all').dropna(axis=1, how='all')

if NM.empty == False:
    ind = NM.index.to_list()
    vals = list(NM.stack().values)
    row2NM = dict(zip(ind, vals))
    print(row2NM)
    
    for index, NM  in row2NM.items(): 
        df_draft = df_draft.drop(index)

# Here with Variant ID (NP_0000000) in CSV
NP = df_draft[df_draft.apply(lambda x:x.str.contains("NP_\d+"))].dropna(how='all').dropna(axis=1, how='all')

if NP.empty == False:
    ind = NP.index.to_list()
    vals = list(NP.stack().values)
    row2NP = dict(zip(ind, vals))
    print(row2NP)
    
    for index, NP  in row2NP.items(): 
        df_draft = df_draft.drop(index)

# Here with ClinVar field (RCV#) in CSV
RCV = df_draft[df_draft.apply(lambda x:x.str.contains("RCV\d+"))].dropna(how='all').dropna(axis=1, how='all')
    
if RCV.empty == False:
    ind = RCV.index.to_list()
    vals = list(RCV.stack().values)
    row2RCV = dict(zip(ind, vals))
    print(row2RCV)
    
    for index, NP  in row2NP.items(): 
        df_draft = df_draft.drop(index)

I was wondering for a more elegant solution of writing this simple but long code. I have been thinking of sa

  • just use `str.contains()` to create a boolean index. Then keep/drop whatever you want/don't want – noah Dec 01 '20 at 22:38
  • Similar to my answer for [this question](https://stackoverflow.com/questions/65081257/how-can-i-remove-a-substring-from-a-given-string-using-pandas/65081427#65081427). Not the best worded question but the answer should more or less have what you need – noah Dec 01 '20 at 22:39

0 Answers0