2

I have a list of lists of sentences and I want to pad all sentences so that they are of the same length.

I was able to do this but I am trying to find most optimal ways to do things and challenge myself.

max_length = max(len(sent) for sent in sents)
list_length = len(sents)
sents_padded = [[pad_token for i in range(max_length)] for j in range(list_length)]
for i,sent in enumerate(sents):
    sents_padded[i][0:len(sent)] = sent 

and I used the inputs:

sents = [["Hello","World"],["Where","are","you"],["I","am","doing","fine"]]
pad_token = "Hi"

Is my method an efficient way to do it or there are better ways to do it?

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
vkaul11
  • 4,098
  • 12
  • 47
  • 79
  • I'd suggest doing this as part of the output, not in the data itself. But what's your desired output, here, same number of word/tokens in each container, or string length, or what? – Kenny Ostrom Jul 24 '20 at 20:04
  • Yes same size for all lists inside the lists of lists. Whichever list is shorter use the pad token. – vkaul11 Jul 24 '20 at 20:12

4 Answers4

7

This is provided in itertools (in python3) for iteration, with zip_longest, which you can just invert normally with zip(*), and pass it to list if you prefer that over an iterator.

import itertools
from pprint import pprint

sents = [["Hello","World"],["Where","are","you"],["I","am","doing","fine"]]
pad_token = "Hi"

padded = zip(*itertools.zip_longest(*sents, fillvalue=pad_token))
pprint (list(padded))

[['Hello', 'World', 'Hi', 'Hi'],
['Where', 'are', 'you', 'Hi'],
['I', 'am', 'doing', 'fine']]

Kenny Ostrom
  • 5,639
  • 2
  • 21
  • 30
  • This is in the requested output format, but as I said in my original comment on the question, you may be better off using providing an iterator with zip_longest to whatever will end up using this. An iterator gives what you need when you need it, instead of fully reconstructing the data set before you even start using it. – Kenny Ostrom Jul 25 '20 at 17:53
0

Here is how you can use str.ljust() to pad each string, and use max() with a key of len to find the number in which to pad each string:

lst = ['Hello World', 'Good day!', 'How are you?']

l = len(max(lst, key=len)) # The length of the longest sentence
lst = [s.ljust(l) for s in lst] # Pad each sentence with l

print(lst)

Output:

['Hello World ',
 'Good day!   ',
 'How are you?']
Red
  • 26,798
  • 7
  • 36
  • 58
0

Assumption:

The output should be the same as OP output (i.e. same number of words in each sublist).

Inputs:

sents = [["Hello","World"],["Where","are","you"],["I","am","doing","fine"]]
pad_token = "Hi"

Following 1-liner produces the same output as OP code.

sents_padded = [sent + [pad_token]*(max_length - len(sent)) for sent in sents]

print(sents_padded)
# [['Hello', 'World', 'Hi', 'Hi'], ['Where', 'are', 'you', 'Hi'], ['I', 'am', 'doing', 'fine']]
DarrylG
  • 16,732
  • 2
  • 17
  • 23
  • Very ingenious indeed! – vkaul11 Jul 24 '20 at 20:21
  • @vkaul11--actually, thought your way was also clever, which makes me confused by this advice from the Zen of Python: "There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch." – DarrylG Jul 24 '20 at 20:25
  • Yes I am currently trying to find different ways to doing things and I feel stack overflow is a great place where we get multiple ways of thinking. I also want to find one obvious way but then itertools and lshift was another way people did. – vkaul11 Jul 24 '20 at 21:50
0

This seemed to be faster when I timed it:

maxi = 0
for sent in sents:
    if sent.__len__() > maxi:
        maxi = sent.__len__()
for sent in sents:
    while sent.__len__() < maxi:
        sent.append(pad_token)
print(sents)
AlpacaJones
  • 114
  • 8
  • 1
    gist timing this and all the other answers : [link](https://gist.github.com/KieranBrannigan/980d9e6809a552153cf8cfbfc2b441a4) – AlpacaJones Jul 24 '20 at 20:45