Tokenize a list

Question

I like to tokenize a list using a list item as a delimiter.

Is there is a pythonic way to do this or do I have to write something on my own.

Data=['Label',23,'NORM','|','RESP',1.256,None,'|','','','|','RELV','','']
SubList = TokenizeList (Data,Delim='|')

printing SubList would result in

[ ['Label',23,'NORM'] , ['RESP',1.256,None] , ['',''] , ['RELV','',''] ]

You mean the delimiter or the list items? The original list can contain anything, the delimite is a character like in .split() — , Dec 06 '19 at 11:46
Yes, they probably can, but a space (or a number of spaces) is here like having an empty string, so I can replace them in advance — , Dec 06 '19 at 11:48
Can they contain only numbers and string or other objects as well? — Riccardo Bucco, Dec 06 '19 at 11:49
Does this answer your question? [Python splitting a list based on a delimiter word](https://stackoverflow.com/questions/15357830/python-splitting-a-list-based-on-a-delimiter-word) — anupsabraham, Dec 06 '19 at 11:54

Sayandip Dutta · Answer 1 · 2019-12-06T11:58:59.123

3

Yes, you can use itertools.groupby:

>>> from itertools import groupby
>>> Data=['Label',23,'NORM','|','RESP',1.256,None,'|','','','|','RELV','','']
>>> [list(g) for k,g in groupby(Data,key=lambda x:x == '|') if not k]
[['Label', 23, 'NORM'], ['RESP', 1.256, None], ['', ''], ['RELV', '', '']]

You can make a function of course:

def splitList(sequence, delimiter):
    return [list(g) for k, g in groupby(sequence, key = lambda x: x == delimiter) if not k]
>>> splitList(sequence = Data, delimiter = '|')
[['Label', 23, 'NORM'], ['RESP', 1.256, None], ['', ''], ['RELV', '', '']]

edited Dec 06 '19 at 11:58

answered Dec 06 '19 at 11:53

Sayandip Dutta

15,602
4
23
52

1

This answer should be accepted because it's the only one which is really *pythonic*, as asked by the user – Riccardo Bucco Dec 06 '19 at 11:55
1

While you can use `itertools.groupby()` for this, it may not be the most efficient approach here – norok2 Dec 06 '19 at 12:04
It is also behaving differently (and it is hard to adapt) if two consecutive separators are found. – norok2 Dec 06 '19 at 12:27

norok2 · Accepted Answer · 2019-12-06T12:26:07.203

Try this:

def group_by_sep(items, sep='|'):
    inner_list = []
    for item in items:
        if item == sep:
            yield inner_list
            inner_list = []
        else:
            inner_list.append(item)
    if inner_list:
        yield inner_list


Data=['Label',23,'NORM','|','RESP',1.256,None,'|','','','|','RELV','','','|','|','now','|']

SubList = list(group_by_sep(Data, '|'))
print(SubList)
# [['Label', 23, 'NORM'], ['RESP', 1.256, None], ['', ''], ['RELV', '', ''], [], ['now']]

Note that a itertools.groupby approach can be used here, but it is not equivalent to the above and offers less control over the exact behavior:

import itertools


def group_by_sep2(items, sep='|'):
    yield from (
        list(g)
        for k, g in itertools.groupby(items, key=lambda x: x == sep)
        if not k)


SubList2 = list(group_by_sep2(Data, '|'))
print(SubList2)
# [['Label', 23, 'NORM'], ['RESP', 1.256, None], ['', ''], ['RELV', '', ''], ['now']]

It is missing the empty list between two consecutive separators.

Additionally, it is not as efficient as the direct method from above:

%timeit list(group_by_sep(Data))
# 1000 loops, best of 3: 1.47 µs per loop
%timeit list(group_by_sep2(Data))
# 100 loops, best of 3: 4.01 µs per loop

%timeit list(group_by_sep(Data * 1000))
# 1000 loops, best of 3: 1.33 ms per loop
%timeit list(group_by_sep2(Data * 1000))
# 100 loops, best of 3: 2.83 ms per loop

%timeit list(group_by_sep(Data * 1000000))
# 1000 loops, best of 3: 1.67 s per loop
%timeit list(group_by_sep2(Data * 1000000))
# 100 loops, best of 3: 3.22 s per loop

And the benchmarks indicate that the direct approach is ~2x to ~3x faster.

(EDITED to write it all as generators and included more edge cases)

@shaikmoeed there isn't much else you can write to solve the problem ;-) — norok2, Dec 06 '19 at 12:12
Our approach matches one of the `zen of python` rules I,e., *Explicit is better than implicit*. (+1) for adding timings. — shaik moeed, Dec 06 '19 at 12:28

shaik moeed · Answer 3 · 2019-12-18T11:44:13.487

1

Try this, which is simple and straight (Pythonic as well),

def tokenize_list(array, sep='|'):
    result = []
    _temp = []
    for el in array:
        if el == sep:
            result.append(_temp)
            _temp = []
        else:
            _temp.append(el)
    if _temp: # Finally append list after for-loop, to store last vlaues present in _temp if exists.
        result.append(_temp) 
    return result

Output:

>>> data = ['Label',23,'NORM','|','RESP',1.256,None,'|','','','|','RELV','','', '|']
>>> tokenize_list(data)
[['Label', 23, 'NORM'], ['RESP', 1.256, None], ['', ''], ['RELV', '', '']]

edited Dec 18 '19 at 11:44

answered Dec 06 '19 at 11:48

shaik moeed

5,300
1
18
54

if `Data` ends with `|` this will not work correctly (adds last inner `list` twice) – norok2 Dec 06 '19 at 11:56
@norok2 Thanks. Edited the answer, to handle that case. – shaik moeed Dec 06 '19 at 12:04

Tokenize a list

3 Answers3

Linked

Related