0

I am working with a relatively large dataset (in Python with Pandas) and am trying to build combinations of multiple columns as a string.

Let's say I have two lists; x and y, where: x = ["sector_1", "sector_2", "sector_3", ...] and y = [7, 19, 21, ...].

I have been using a for loop to build combinations such that combined = ["sector_1--7", "sector_1--19", "sector_1--21", "sector_2--7", "sector_2--19", ...], with the separator here defined as --.

My current code looks like this:

sep = '--'
combined = np.empty(0, dtype='object')
for x_value in x:
    for y_value in y:
        combined = np.append(combined,  str(x_value) + sep + str(y_value))
combined = pd.DataFrame(combined)
combined = combined.iloc[:, 0].str.split(sep, expand=True)

The code above works but I was just wondering if there was a better way (perhaps more efficient in runtime).

ehsan shirzadi
  • 4,709
  • 16
  • 69
  • 112
  • Use [`itertools.product()`](https://docs.python.org/3/library/itertools.html#itertools.product) – Barmar Dec 17 '21 at 18:02
  • It seems to me this question is more suited to be asked in the [Code Review Forum](https://codereview.stackexchange.com/). Code Review is a question and answer site for peer programmer code reviews. Please read the relevant guidance related to how to properly ask questions on this site before posting your question. – itprorh66 Dec 17 '21 at 18:03
  • whoops, sorry I wasn't aware there was a specific forum for peer programmer code reviews. Thanks for pointing that out! – matt_was_unavailable Dec 17 '21 at 18:10
  • `combined = ["--".join(map(str,s)) for s in itertools.product(x, y)]`? – not_speshal Dec 17 '21 at 18:11
  • Does this answer your question? [Permutations between two lists of unequal length](https://stackoverflow.com/questions/12935194/permutations-between-two-lists-of-unequal-length) – not_speshal Dec 17 '21 at 18:12
  • @MathewHalim please see the bottom of my answer. It's much, ***much*** more efficient that this solution you have here :) –  Dec 17 '21 at 18:15
  • I've voted to reopen this because it has an accepted answer. Moving it to CR is not needed. Improvement questions like this, especially if they involve more efficient methods, not just stylistic changes, are often answered on SO. Also CR is pickier about the completeness of the original question. – hpaulj Dec 17 '21 at 19:01
  • @itprorh66 - it's always a good idea to point the asker at [A guide to Code Review for Stack Overflow users](//codereview.meta.stackexchange.com/a/5778), as some things are done differently over there - e.g. we need a good description of the *purpose* of the code to give context, and question titles should simply say what the code *does* (the question is always, "_How can I improve this?_"). It's important that the code works correctly; include the unit tests if possible. – Toby Speight Dec 18 '21 at 11:25

1 Answers1

1

Try this:

import itertools as it
combined = [f'{a}--{b}' for a, b in it.product(x, y)]

Output:

>>> combined
['sector_1--7',
 'sector_1--19',
 'sector_1--21',
 'sector_1--Ellipsis',
 'sector_2--7',
 'sector_2--19',
 'sector_2--21',
 'sector_2--Ellipsis',
 'sector_3--7',
 'sector_3--19',
 'sector_3--21',
 'sector_3--Ellipsis',
 'Ellipsis--7',
 'Ellipsis--19',
 'Ellipsis--21',
 'Ellipsis--Ellipsis']

Instead of all that though, you should use a combination of np.tile and np.repeat:

combined_df = pd.DataFrame({0: np.repeat(x, len(x)), 1: np.tile(y, len(x))})

Output:

>>> combined_df
           0         1
0   sector_1         7
1   sector_1        19
2   sector_1        21
3   sector_1  Ellipsis
4   sector_2         7
5   sector_2        19
6   sector_2        21
7   sector_2  Ellipsis
8   sector_3         7
9   sector_3        19
10  sector_3        21
11  sector_3  Ellipsis
12  Ellipsis         7
13  Ellipsis        19
14  Ellipsis        21
15  Ellipsis  Ellipsis