23

I have a list of tuples that has strings in it For instance:

[('this', 'is', 'a', 'foo', 'bar', 'sentences')
('is', 'a', 'foo', 'bar', 'sentences', 'and')
('a', 'foo', 'bar', 'sentences', 'and', 'i')
('foo', 'bar', 'sentences', 'and', 'i', 'want')
('bar', 'sentences', 'and', 'i', 'want', 'to')
('sentences', 'and', 'i', 'want', 'to', 'ngramize')
('and', 'i', 'want', 'to', 'ngramize', 'it')]

Now I wish to concatenate each string in a tuple to create a list of space separated strings. I used the following method:

NewData=[]
for grams in sixgrams:
       NewData.append( (''.join([w+' ' for w in grams])).strip())

which is working perfectly fine.

However, the list that I have has over a million tuples. So my question is that is this method efficient enough or is there some better way to do it. Thanks.

alphacentauri
  • 1,000
  • 3
  • 14
  • 26
  • There is no real efficiency issue here, but the code can be written much more neatly. Aside from that, the way `.join` is being used here a) is strange, missing the point of the syntax (if you want `' '` to go in between the elements, then that is what should be on the left-hand side of `.join`) and b) will add a trailing space that then requires even more extra work. The linked duplicates explain the right way to join the individual tuples as well as the options for repeating the process across the list. – Karl Knechtel Mar 07 '23 at 20:21

3 Answers3

28

For a lot of data, you should consider whether you need to keep it all in a list. If you are processing each one at a time, you can create a generator that will yield each joined string, but won't keep them all around taking up memory:

new_data = (' '.join(w) for w in sixgrams)

if you can get the original tuples also from a generator, then you can avoid having the sixgrams list in memory as well.

lvc
  • 34,233
  • 10
  • 73
  • 98
  • and how to you get each item from this generator when you added an if condition inside? – mrbTT Jun 21 '22 at 22:56
12

The list comprehension creates temporary strings. Just use ' '.join instead.

>>> words_list = [('this', 'is', 'a', 'foo', 'bar', 'sentences'),
...               ('is', 'a', 'foo', 'bar', 'sentences', 'and'),
...               ('a', 'foo', 'bar', 'sentences', 'and', 'i'),
...               ('foo', 'bar', 'sentences', 'and', 'i', 'want'),
...               ('bar', 'sentences', 'and', 'i', 'want', 'to'),
...               ('sentences', 'and', 'i', 'want', 'to', 'ngramize'),
...               ('and', 'i', 'want', 'to', 'ngramize', 'it')]
>>> new_list = []
>>> for words in words_list:
...     new_list.append(' '.join(words)) # <---------------
... 
>>> new_list
['this is a foo bar sentences', 
 'is a foo bar sentences and', 
 'a foo bar sentences and i', 
 'foo bar sentences and i want', 
 'bar sentences and i want to', 
 'sentences and i want to ngramize', 
 'and i want to ngramize it']

Above for loop can be expressed as following list comprehension:

new_list = [' '.join(words) for words in words_list] 
falsetru
  • 357,413
  • 63
  • 732
  • 636
7

You can do this efficiently like this

joiner = " ".join
print map(joiner, sixgrams)

We can still improve the performance using list comprehension like this

joiner = " ".join
print [joiner(words) for words in sixgrams]

The performance comparison shows that the above seen list comprehension solution is slightly faster than other two solutions.

from timeit import timeit

joiner = " ".join

def mapSolution():
    return map(joiner, sixgrams)

def comprehensionSolution1():
    return ["".join(words) for words in sixgrams]

def comprehensionSolution2():
    return [joiner(words) for words in sixgrams]

print timeit("mapSolution()", "from __main__ import joiner, mapSolution, sixgrams")
print timeit("comprehensionSolution1()", "from __main__ import sixgrams, comprehensionSolution1, joiner")
print timeit("comprehensionSolution2()", "from __main__ import sixgrams, comprehensionSolution2, joiner")

Output on my machine

1.5691678524
1.66710209846
1.47555398941

The performance gain is most likely because of the fact that, we don't have to create the join function from the empty string everytime.

Edit: Though we can improve the performance like this, the most pythonic way is to go with generators like in lvc's answer.

Community
  • 1
  • 1
thefourtheye
  • 233,700
  • 52
  • 457
  • 497