-3

I am working on a data set . It's first row looks like this:

Column1 Column2

1 [food=3, party=2,....] 2 [ocean=2, fish=3, surf=2,....] . .................. . .................. . .................. (Column1 has serial numbers and Column2 has the list of words with frequencies.)

Each row has multiple words with their respective frequencies.

I would like to convert the column2 as follows:

[food, food, food, party, party.....] and so on.

I am finding it difficult and don't know where to start. I tried tokenizing, but don't know how to proceed.

3 Answers3

0

Assuming your data in a list

l=['food=3', 'party=2']

s=pd.Series(l).str.split('=',expand=True)# split by sep `=`

s.iloc[:,0].repeat(s.iloc[:,1].astype(int)).tolist()# using repeat 
Out[549]: ['food', 'food', 'food', 'party', 'party']
BENY
  • 317,841
  • 20
  • 164
  • 234
0

Here is one way.

from itertools import chain

data = [['food=3', 'party=2'],
        ['drink=5', 'sleep=1']]

def repeater(lst):
    return list(chain(*([j[0]]*int(j[1]) for j in (i.split('=') for i in lst))))

list(map(repeater, data))

# [['food', 'food', 'food', 'party', 'party'],
#  ['drink', 'drink', 'drink', 'drink', 'drink', 'sleep']]
jpp
  • 159,742
  • 34
  • 281
  • 339
0

Assuming you are starting with a list of lists of strings, you could do this:

dataset = [
    ['food=3', 'party=2'],
    ['word=2', 'apple=3'],
]

def multiply_word(item):
    word, freq = item.split('=')
    return [word] * int(freq)

result = [
    sum((multiply_word(item) for item in row), [])
    for row in dataset
]

result
# [
#     ['food', 'food', 'food', 'party', 'party'], 
#     ['word', 'word', 'apple', 'apple', 'apple']
# ]

Or you could use this "one-liner" (inspired by @jp_data_analysis's answer):

[
    sum(
        ([word] * int(freq) for word, freq in (item.split('=') for item in row)),
        []
    )
    for row in dataset
]

If you have a lot of words in each row, then you should probably use itertools.chain instead of sum. See why sum on lists is (sometimes) faster than itertools.chain? .

Matthias Fripp
  • 17,670
  • 5
  • 28
  • 45