1

I have a task that is completely driving me mad. Lets suppose we have this df:

import pandas as pd

k = {'random_col':{0:'a',1:'b',2:'c'},'isin':{0:'ES0140074008', 1:'ES0140074008ES0140074010', 2:'ES0140074008ES0140074016ES0140074024'},'n_isins':{0:1,1:2,2:3}}

k = pd.DataFrame(k)

What I want to do is to double or triple a row a number of times goberned by col n_isins which is a number obtained by dividing the lentgh of col isin didived by 12, as isins are always strings of 12 characters.

So, I need 1 time row 0, 2 times row 1 and 3 times row 2. My real numbers are up-limited by 6 so it is a hard task. I began by using booleans and slicing the col isin but that does not take me to nothing. Hopefully my explanation is good enough. Also I need the col isin sliced like this [0:11] + ' ' + [12:23]... splitting by the 'E' but I think I know how to do that, I just post it cause is the criteria that rules the number of times I have to copy each row. Thanks in advance!

Borja_042
  • 1,071
  • 1
  • 14
  • 26

1 Answers1

1

I think you need numpy.repeat with loc, last remove duplicates in index by reset_index. Last for new column use custom splitting function with numpy.concatenate:

n = np.repeat(k.index, k['n_isins'])
k = k.loc[n].reset_index(drop=True)
print (k)
                                   isin  n_isins random_col
0                          ES0140074008        1          a
1              ES0140074008ES0140074010        2          b
2              ES0140074008ES0140074010        2          b
3  ES0140074008ES0140074016ES0140074024        3          c
4  ES0140074008ES0140074016ES0140074024        3          c
5  ES0140074008ES0140074016ES0140074024        3          c

#https://stackoverflow.com/a/7111143/2901002
def chunks(s, n):
    """Produce `n`-character chunks from `s`."""
    for start in range(0, len(s), n):
        yield s[start:start+n]

s = np.concatenate(k['isin'].apply(lambda x: list(chunks(x, 12))))
df['new'] = pd.Series(s, index = df.index)
print (df)
                                   isin  n_isins random_col           new
0                          ES0140074008        1          a  ES0140074008
1              ES0140074008ES0140074010        2          b  ES0140074008
2              ES0140074008ES0140074010        2          b  ES0140074010
3  ES0140074008ES0140074016ES0140074024        3          c  ES0140074008
4  ES0140074008ES0140074016ES0140074024        3          c  ES0140074016
5  ES0140074008ES0140074016ES0140074024        3          c  ES0140074024
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Using your code in k df test I am using I get this error and I do not really understand why: ValueError: Wrong number of items passed 14, placement implies 6 – Borja_042 May 23 '17 at 10:51
  • There is problem your data are wrongly splitted, what return `s` ? – jezrael May 23 '17 at 10:53
  • You can simulate it by `s = np.concatenate(k['isin'].apply(lambda x: list(chunks(x, 10))))` with sample data - splited by 10, not 12. – jezrael May 23 '17 at 10:53
  • With that sentence i get this same error: ValueError: Wrong number of items passed 20, placement implies 6 and I need to cut the string for each 12 characters – Borja_042 May 23 '17 at 10:55
  • Ok, error raise in `df['new'] = pd.Series(s, index = df.index)`, right? – jezrael May 23 '17 at 10:57
  • Yes: ValueError: Wrong number of items passed 20, placement implies 6. I am on python3, but I think you are with that version too – Borja_042 May 23 '17 at 11:00
  • Ok, and it means `s = np.concatenate(k['isin'].apply(lambda x: list(chunks(x, 10))))` return less or more values as length of rows. Rows is `20`, but length of `s` is `6`. So i think you split values wrongly. Test it by `print (s)` – jezrael May 23 '17 at 11:02
  • This is what I get: s Out[19]: array(['ES0140074008', 'ES0140074008', 'ES0140074010', ..., 'ES0140074008', 'ES0140074016', 'ES0140074024'], dtype=' – Borja_042 May 23 '17 at 11:14
  • Yes, but if check `print (len(s))` and `print (len(df))` it is difference, right? And so get error. – jezrael May 23 '17 at 11:20
  • print (len(s)) > 14 print (len(df))> 6 – Borja_042 May 23 '17 at 11:24
  • So problem is in data. because if split by `12` chars as in sample `s = np.concatenate(k['isin'].apply(lambda x: list(chunks(x, 12))))` it works perfectly. Do you split by another number as `12`? – jezrael May 23 '17 at 12:12
  • Or I see another problem. In your sample `n_isins` column return number of splitted `isin`. If it is changed, get your error also. But if still problem, can you change your data sample `k = {'random_col':{0:'a',1:'b',2:'c'},'isin':{0:'ES0140074008', 1:'ES0140074008ES0140074010', 2:'ES0140074008ES0140074016ES0140074024'},'n_isins':{0:1,1:2,2:3}}` ? – jezrael May 23 '17 at 12:16
  • I have idea - If data are not confidental, can you send it to my email with desired output? If confidental, can you anonymize it before? – jezrael May 23 '17 at 16:03
  • I think I already have it mate! I finally made it with several loops, but I thought maybe pandas would allow me to do it in a more direct way. Many thanks for your help!. I was parsing this table: http://www.bde.es/webbde/es/estadis/fi/ifs_es.html – Borja_042 May 24 '17 at 07:41