1

I have a big data frame with N columns. Columns are presented in pairs as follows:

  • column 1: ISIN 1, sequence of daily dates (issuance to maturity of bond 1)
  • column 2: historical data on prices wrt ISIN1
  • column 3: ISIN 2, sequence of daily dates (issuance to maturity of bond 2)
  • column 4: historical data on prices wrt ISIN2 and so on.

Columns are paired like this: the first two go together, and so the next two, until the end of the dataframe:

  XS0552790049  Unnamed: 5583 XS0628646480  Unnamed: 5585
0   2010-10-22          100.0   2011-05-24         99.711
1   2010-10-25          100.0   2011-05-25         99.685
2   2010-10-26          100.0   2011-05-26        100.125
3   2010-10-27          100.0   2011-05-27         99.893
4   2010-10-28          100.0   2011-05-30         99.792

I want to subset this big data frame into N/2 subsamples, each containing a pair of columns "ISIN dates + prices", as shown above. I thought about using a for loop, but I am definitely missing something as it does not generate the subsamples. Perhaps I am indexing wrong.

Here's my attempt: I tried to create a dictionary containing a subsample for every key.

sub = {}
for i in range(0,len(df.columns)+1):
    sub[i] = df.iloc[:,i:i+3]

I am pretty new with Python, so any suggestion is welcome.

Ian Thompson
  • 2,914
  • 2
  • 18
  • 31
  • 1
    Welcome to Stack Overflow! Please take the [tour](https://stackoverflow.com/tour). Input data is better shared as text, see [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples), to help us help you. Dataframes are best shared as print(df) or df.to_dict() – OCa Sep 01 '23 at 13:41
  • Based on your example data, what do you expect the output to be? – Ian Thompson Sep 01 '23 at 14:36
  • Good question, actually. I've been screening for duplicates, but I'm not finding it asked and answered before. – OCa Sep 01 '23 at 15:40
  • You could omit the financial wording (ISIN, wrt...) as it does not bring value to the coding part, and only obscures your introduction. – OCa Sep 01 '23 at 15:43
  • Thank you @IanThompson for providing the input dataframe as text. I've edited my answer to use it instead of a dummy – OCa Sep 01 '23 at 15:58
  • @IanThompson did you use OCR? curious what your method was – OCa Sep 01 '23 at 16:08
  • 1
    @OCa -- I typed it out manually – Ian Thompson Sep 01 '23 at 19:33
  • @IanThompson all right, one comment upvote for support :D – OCa Sep 01 '23 at 19:35
  • Did you manage to make it work for you? – OCa Sep 02 '23 at 08:51

1 Answers1

0

Mostly, you just omitted the step in your range(start, stop, step) iterator, use step=2.

Then list comprehensions advantageously encase for loops in such cases:

dfs = [ df.iloc[:,[i,i+1]] for i in range(0, len(df.columns), 2) ]

This will return your requested list of pairwise subsets:

dfs
[  XS0552790049  Unnamed: 5583
 0   2010-10-22          100.0
 1   2010-10-25          100.0
 2   2010-10-26          100.0
 3   2010-10-27          100.0
 4   2010-10-28          100.0,
   XS0628646480  Unnamed: 5585
 0   2011-05-24         99.711
 1   2011-05-25         99.685
 2   2011-05-26        100.125
 3   2011-05-27         99.893
 4   2011-05-30         99.792]
dfs[0]
  XS0552790049  Unnamed: 5583
0   2010-10-22          100.0
1   2010-10-25          100.0
2   2010-10-26          100.0
3   2010-10-27          100.0
4   2010-10-28          100.0

Side notes:

  • One should refrain from using sub as a variable name, since this is a Python function in the re module.
  • {} is for instanciating a dictionary, while you seem to require a list.
  • df.shape[1] may replace len(df.columns), since dataframe dimensions are also given by df.shape as a tuple.
OCa
  • 298
  • 2
  • 13