1

I have sets of strings, from which I need to construst the main indicator variable columns in a DataFrame. Is there a way to do this dimension expansion in Python pandas?

E.g. if I have these two sets:

los = set(["abc", "def"])
his = set(["X", "Y", "Z"])

I want to get as a result a DataFrame that includes all the combinations of the sets. Like this:

import pandas as pd
df = pd.DataFrame({"los": ["abc", "abc", "abc", "def", "def", "def"], "his": ["X", "Y", "Z", "X", "Y", "Z"]})

Ideally, I'd like this to be easily generalizable to an arbitrary number of sets.

Antti
  • 1,263
  • 2
  • 16
  • 28
  • What do you expect the output dataframe to look like? Do you want 'los' & 'his' as column headings? In the case you have provided, since the number of combinations of los exceed the number of combinations of his, the length of columns will be different. Do you want the excess rows to be filled with NaN or empty string? – itprorh66 Oct 20 '20 at 16:45
  • I am not sure I understand your point? The resulting DataFrame is as I have stated. – Antti Oct 20 '20 at 18:24

4 Answers4

2

You can use the python built-in itertools.product to do this very easily:

import itertools
import pandas as pd

los = set(["abc", "def"])
his = set(["X", "Y", "Z"])

data = itertools.product(los, his)
df = pd.DataFrame(data, columns=["los", "his"])

print(df)
   los his
0  def   Z
1  def   X
2  def   Y
3  abc   Z
4  abc   X
5  abc   Y
Cameron Riddell
  • 10,942
  • 9
  • 19
  • Thanks! This is what I was looking for. I liked the MultiIndex approach by @political_scientist too, but your solution was three times faster. – Antti Oct 20 '20 at 18:24
2

You can use itertools.product for this:

In [1308]: import itertools
In [1312]: x, y = [], []

In [1314]: for i,j in itertools.product(los,his):
      ...:     x.append(i)
      ...:     y.append(j)
      ...: 

In [1315]: x
Out[1315]: ['abc', 'abc', 'abc', 'def', 'def', 'def']

In [1316]: y
Out[1316]: ['Z', 'X', 'Y', 'Z', 'X', 'Y']

Then you can create your df like this:

In [1318]: df = pd.DataFrame({'los': x, 'his': y})

In [1319]: df
Out[1319]: 
   los his
0  abc   Z
1  abc   X
2  abc   Y
3  def   Z
4  def   X
5  def   Y
Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58
1

A nested for loop should generate your data

los = set(["abc", "def"])
his = set(["X", "Y", "Z"])

a = []
b = []
for i in los:
    for j in his:
        a.append(i)
        b.append(j)

results in

a = ['def', 'def', 'def', 'abc', 'abc', 'abc']
b = ['X', 'Y', 'Z', 'X', 'Y', 'Z']

If you want it in dictionary format:

d = {}
d['los'] = a
d['his'] = b

A more pythonic way to do it would be via list comprehension. See this SO post for more details.

willwrighteng
  • 1,411
  • 11
  • 25
1

Check out pandas.MultiIndex.from_product. This way you don't need to import itertools:

pd.MultiIndex.from_product([los, his], names=['los', 'his']).to_frame(index=False)
help-ukraine-now
  • 3,850
  • 4
  • 19
  • 36