3

How do I merge the bigrams below to a single string?

_bigrams=['the school', 'school boy', 'boy is', 'is reading']
_split=(' '.join(_bigrams)).split()
_newstr=[]
_filter=[_newstr.append(x) for x in _split if x not in _newstr]
_newstr=' '.join(_newstr)
print _newstr

Output:'the school boy is reading'....its the desired output but the approach is too long and not quite efficient given the large size of my data. Secondly, the approach would not support duplicate words in the final string ie 'the school boy is reading, is he?'. Only one of the 'is' will be permitted in the final string in this case.

Any suggestions on how to make this work better? Thanks.

Tiger1
  • 1,327
  • 5
  • 19
  • 40

3 Answers3

2
# Multi-for generator expression allows us to create a flat iterable of words
all_words = (word for bigram in _bigrams for word in bigram.split())

def no_runs_of_words(words):
    """Takes an iterable of words and returns one with any runs condensed."""
    prev_word = None
    for word in words:
        if word != prev_word:
            yield word
        prev_word = word

final_string = ' '.join(no_runs_of_words(all_words))

This takes advantage of generators to lazily evaluate and not keep the entire set of words in memory at the same time until generating the one final string.

Amber
  • 507,862
  • 82
  • 626
  • 550
  • thanks for the solution. It actually addressed the problem, but I will wait a bit to see if I can get a two or three line-code solution before marking it as answered. – Tiger1 Mar 15 '14 at 16:16
  • That's up to you - though I would note that number of lines is not always the best metric upon which to judge code quality. – Amber Mar 15 '14 at 16:21
  • you are absolutely right, but I've got a massive algorithm which I'm trying to compress. – Tiger1 Mar 15 '14 at 16:23
  • Breaking something into clear functions is often a better way to make algorithms understandable than simply reducing the number of lines. For instance, the `no_runs_of_words()` function is easier to read when looking at how the final string is generated. The person reading the algorithm doesn't have to care about how that function is implemented, because they can clearly tell what it does from the name. – Amber Mar 15 '14 at 16:24
  • This won't work if there are repeated contiguous words leading to a `_bigrams` entry of (e.g.) `"that that"`. – DSM Mar 15 '14 at 18:13
  • @DSM Yes. However that's not usually an issue. – Amber Mar 15 '14 at 21:52
2

If you really wanted a oneliner, something like this could work:

' '.join(val.split()[0] for val in (_bigrams)) + ' ' +  _bigrams[-1].split()[-1]
M4rtini
  • 13,186
  • 4
  • 35
  • 42
1

Would this do it? It does simply take the first word up to the last entry

_bigrams=['the school', 'school boy', 'boy is', 'is reading']

clause = [a.split()[0] if a != _bigrams[-1] else a for a in _bigrams]

print ' '.join(clause)

Output

the school boy is reading

However, concerning performance probably Amber's solution is a good option

embert
  • 7,336
  • 10
  • 49
  • 78