1
header =  ['chr', 'pos', 'ms01e_PI', 'ms01e_PG_al', 'ms02g_PI', 'ms02g_PG_al', 'ms03g_PI', 'ms03g_PG_al', 'ms04h_PI', 'ms04h_PG_al']

I want to convert the above list elements into list of tuples. Like:

sample_list = [('ms01e_PI', 'ms01e_PG_al'), ('ms02g_PI', 'ms02g_PG_al'),
              'ms03g_PI', 'ms03g_PG_al'), ('ms04h_PI', 'ms04h_PG_al')]

I am thinking lambda or list comprehension can be used to approach this in a short and comprehensive way.

sample_list = [lambda (x,y): x = a if '_PI' in a for a in header ..]

or,

[(x, y) if '_PI' and '_PG_al' in a for a in header]

any suggestions?

pault
  • 41,343
  • 15
  • 107
  • 149
everestial007
  • 6,665
  • 7
  • 32
  • 72
  • Seems like you want pairs of consecutive elements. If so, this is a perfect use case for [`zip()`](https://stackoverflow.com/questions/13704860/zip-lists-in-python): First remove the first 2 elements: `header = header[2:]` and then do `zip(header[::2], header[1::2])`. See also: [Understanding python's slice notation](https://stackoverflow.com/questions/509211/understanding-pythons-slice-notation). – pault Feb 09 '18 at 18:59

4 Answers4

1

You can filter the list and remove all elements that do not match the desired grouping pattern:

import re
import itertools
header =  ['chr', 'pos', 'ms01e', 'ms01e_PG_al', 'ms01e_PI', 'ms01e_PG_al', 'ms02g_PI', 'ms02g_PG_al', 'ms03g_PI', 'ms03g_PG_al', 'ms04h_PI', 'ms04h_PG_al']
new_headers = list(filter(lambda x:re.findall('^[a-zA-Z]+_[a-zA-Z]+|[a-zA-Z]+\d+[a-zA-Z]+', x), header))
final_data = [(new_headers[i], new_headers[i+1]) for i in range(0, len(new_headers), 2)]

Output:

[('ms01e', 'ms01e_PG_al'), ('ms01e_PI', 'ms01e_PG_al'), ('ms02g_PI', 'ms02g_PG_al'), ('ms03g_PI', 'ms03g_PG_al'), ('ms04h_PI', 'ms04h_PG_al')]
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • I totally like this idea. I had hands on `list comprehension` but totally missied I could jump two steps to work tout the problem. Also, Is there a way to do this by matching `sample name` before `_` , if situation arises.? – everestial007 Feb 09 '18 at 19:34
  • @everestial007 Thank you. However, I am slightly confused. What do you mean by `sample name`? – Ajax1234 Feb 09 '18 at 19:44
  • The sample names are `ms01e, ms02g...` and each sample has two values `ms01e_PI, ms01e_PG_al`. So, after removing `chr and pos` all sample should have values in pairs. I am thinking if there is a way to mine unique sample from data by matching `sample name before underscore _` and then pairs as `PI and PG after underscore _ `. The final output is the same `list of tuples` but the way is different. This would be important if the data is not ordered by names. – everestial007 Feb 09 '18 at 19:47
  • @everestial007 Please see my recent edit. I modified the solution so that `header` elements similar to `ms01e` will be contained by the `filter` function. – Ajax1234 Feb 09 '18 at 19:53
  • The part of the code that needs changing is `final_data = [(new_headers[i], new_headers[i+1]) for i in range(0, len(new_headers), 2)]` , because if the list is unordered it would create problem. So, finding unique names before `_` and then updating the `[()]` would be a better approach. – everestial007 Feb 09 '18 at 19:58
1

Try this:

list = ['chr', 'pos', 'ms01e_PI', 'ms01e_PG_al', 'ms02g_PI', 'ms02g_PG_al', 'ms03g_PI', 'ms03g_PG_al', 'ms04h_PI', 'ms04h_PG_al']


def l_tuple(list):
    list = filter(lambda x: "PI" in x or "PG" in x, list)
    l = sorted(list, key=lambda x: len(x) and x[:4])
    return [(l[i], l[i + 1]) for i in range(0, len(l), 2)]

print(l_tuple(list))

Output

[('ms01e_PI', 'ms01e_PG_al'), ('ms02g_PI', 'ms02g_PG_al'), ('ms03g_PI', 'ms03g_PG_al'), ('ms04h_PI', 'ms04h_PG_al')]
CrizR
  • 688
  • 1
  • 6
  • 26
1

This is one way:

# first, filter and sort
header = sorted(i for i in header if any(k in i for k in ('_PI', '_PG_al')))

# second, zip and order by suffix
header = [(x, y) if '_PI' in x else (y, x) for x, y in zip(header[::2], header[1::2])]

# [('ms01e_PI', 'ms01e_PG_al'),
#  ('ms02g_PI', 'ms02g_PG_al'),
#  ('ms03g_PI', 'ms03g_PG_al'),
#  ('ms04h_PI', 'ms04h_PG_al')]
jpp
  • 159,742
  • 34
  • 281
  • 339
0

I had a concern where the input header may not have sample (PI and PG values) as ordered/organized. I think it would be better to mine the sample names first and then later create the list of tuples in following manner.

header =  ['chr', 'pos', 'ms01e_PI', 'ms01e_PG_al', 'ms02g_PI', 'ms02g_PG_al', 'ms03g_PI', 'ms03g_PG_al', 'ms04h_PI', 'ms04h_PG_al']

''' Keep the names of all the samples, after removing chr, pos and
also remove the other suffixes after the underscore(_). '''
samples = [x.split('_')[0] for x in header if '_' in x]

''' Now, create the reduced list (basically a set). But, if order is of 
interest it can be preserved using this method. '''

''' Create an empty set '''
seen = set()
sample_set = [x for x in samples02 if not (x in seen or seen.add(x))]

''' Now, create the tuples of list ''' 
sample_list = [((x + '_PI'), (x + '_PG_al')) for x in sample_set]
print('sample list: ', sample_list)

sample list:  [('ms01e_PI', 'ms01e_PG_al'), ('ms02g_PI', 'ms02g_PG_al'), ('ms03g_PI', 'ms03g_PG_al'), ('ms04h_PI', 'ms04h_PG_al')]
everestial007
  • 6,665
  • 7
  • 32
  • 72