1

There is a similar question but the output I am looking for is different.

I have a dataframe which lists all the words (columns) and the number they occur for each document (rows).

It looks like this:

{'orange': {0: '1',
1: '3'},
'blue': {0: '0',
1: '2'}}

The output should "re-create" the original document as a bag of words in this way:

corpus = [
['orange'],
['orange', 'orange', 'orange', 'blue', 'blue']]

How to do this?

Nick
  • 2,924
  • 4
  • 36
  • 43

1 Answers1

2

if you want a dataframe at the end, you could do:

import pandas as pd
from collections import defaultdict
data = {'orange': {0: '1',
                   1: '3'},
        'blue': {0: '0',
                 1: '2'}}


results = defaultdict(list)
for color, placement in data.items():
    for row, count in placement.items():
        values = results[row]
        values.extend(int(count) * [color])
df = pd.DataFrame.from_dict(results, orient='index')

if you want a list of list just do:

[v for row, v in results.items()]

instead of the df build

Steven G
  • 16,244
  • 8
  • 53
  • 77