1

I have a dataframe that the first column are strings, the 2nd column are the numbers that I want to replicate each string.

df = pd.DataFrame({'String':['a','b', 'c', 'd', 'e', 'f', 'g'],
        'Times':[2, 3, 5, 3, 2, 4, 5]})
df
String Times
a 2
b 3
c 5
d 3
e 2
f 4
g 5

How can I create a data frame into this in python? (stopped at d but you know the pattern). Thanks!

String Times
a 1
a 2
b 1
b 2
b 3
c 1
c 2
c 3
c 4
c 5
wjandrea
  • 28,235
  • 9
  • 60
  • 81
Shang Lu
  • 91
  • 6

5 Answers5

3

A bit messy, but seems to do what you want:

new_df = pd.DataFrame(df.String.str.repeat(df.Times).apply(list).explode().reset_index(drop=True))

new_df["Times"] = new_df.groupby("String").cumcount() + 1

Output:

   String  Times
0       a      1
1       a      2
2       b      1
3       b      2
4       b      3
5       c      1
6       c      2
7       c      3
8       c      4
9       c      5
10      d      1
11      d      2
12      d      3
13      e      1
14      e      2
15      f      1
16      f      2
17      f      3
18      f      4
19      g      1
20      g      2
21      g      3
22      g      4
23      g      5
ddejohn
  • 8,775
  • 3
  • 17
  • 30
3

Another method could be:

df.assign(Times = df.Times.apply(lambda x: range(1, x + 1))).explode('Times')
Out[]: 
  String Times
0      a     1
0      a     2
1      b     1
1      b     2
1      b     3
2      c     1
2      c     2
2      c     3
2      c     4
2      c     5
Onyambu
  • 67,392
  • 3
  • 24
  • 53
2

Use Index.repeat to replicate the rows and groupby+cumcount: to enumerate them:

(df.loc[df.index.repeat(df['Times'])]
   .assign(Times=lambda d: d.groupby('String').cumcount().add(1))
 )

Output:

  String  Times
0      a      1
0      a      2
1      b      1
1      b      2
1      b      3
2      c      1
2      c      2
2      c      3
2      c      4
2      c      5
3      d      1
...
comparison of all answers (7k rows):
@mozway
11.2 ms ± 1.63 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

@Onyambu
16 ms ± 685 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@ifly6
27.3 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@ddejohn
28.5 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

On 700k rows: 1s, 1.43s, 2.34s, 2.94s in the same order

mozway
  • 194,879
  • 13
  • 39
  • 75
1

Use df.explode. Make this work by converting your repetitions to list:

s = ['a','b', 'c', 'd', 'e', 'f', 'g']
t = [2, 3, 5, 3, 2, 4, 5]

Then,

times = [list(range(i + 1)) for i in t]
df = pd.DataFrame({'string': s, 'times': times}).explode('times')
>>> df.head(10)
  string times
0      a     0
0      a     1
0      a     2
1      b     0
1      b     1
1      b     2
1      b     3
2      c     0
2      c     1
2      c     2

If you don't want the index repeated like that, df.reset_index(drop=True, inplace=True) (if in place, otherwise, reassign). If you want 1-indexing rather than 0-indexing, for list(range(i + 1)) substitute list(range(1, i + 2)). Alternatively, just add one to the times column at the end. The times column doesn't really matter for df.explode, it explodes based on repetitions anyway.

ifly6
  • 5,003
  • 2
  • 24
  • 47
0
ss1=df.String.mul(df.Times).map(list).explode("")
ss2=ss1.groupby(ss1).cumcount().add(1)
pd.DataFrame(dict(String=ss1,Times=ss2))

Output:

   String  Times
0       a      1
1       a      2
2       b      1
3       b      2
4       b      3
5       c      1
6       c      2
7       c      3
8       c      4
9       c      5
10      d      1
11      d      2
12      d      3
13      e      1
14      e      2
15      f      1
16      f      2
17      f      3
18      f      4
19      g      1
20      g      2
21      g      3
22      g      4
23      g      5
G.G
  • 639
  • 1
  • 5