Extracting indices from Multi-level DataFrame as strings in a list

Question

I have a Multi-level DataFrame with 36 entries: It's a pandas DF that has 36 levels (36 stocks). Attached to a single-date time. For anyone curious, this is output that is created by using Zipline API, or Quantopians Pipeline API. So I don't have much control on how the dataframe is created.

As you see in the DF, each level is represented by an equity, or stock, for example: Equity(1251 [CAJ])

I'm trying to figure out how to extract the ticker symbol for each level in a string format and add them to a list! Such as list = ['CAJ', 'CBT', 'GILD', ...] followed by all other equities the program spit out.

This has caused me a huge headache and I've tried proceeding with:
result = df asset_list = result.index.levels[1] stocks = asset_list.get_level_values(0).unique()

But this does not seem to work and I get something way off, it might be an API issue as len(stocks) provides a value of 8830. I don't even know how that is possible.

Would greatly appreciate some coding help here. Please let me know if you need more information from my side.

As mentioned from the thread below thanks to @sammywemmy, here is a dictionary of the first 5 rows of the table above.

result.head().to_dict()

reproduces:

{'current ratio': {(Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(1251 [CAJ])): 1.921883,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(1315 [CBT])): 2.0836239999999999,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3212 [GILD])): 3.1044160000000001,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3460 [HAS])): 5.3676060000000003,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3798 [IDA])): 1.5076229999999999},
 'dividend yield': {(Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(1251 [CAJ])): 7.4899999999999993,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(1315 [CBT])): 5.3900000000000006,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3212 [GILD])): 3.29,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3460 [HAS])): 4.0599999999999996,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3798 [IDA])): 3.04},
 'interest coverage': {(Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(1251 [CAJ])): 227.99559500000001,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(1315 [CBT])): 4.5714290000000002,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3212 [GILD])): 8.8230450000000005,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3460 [HAS])): 9.5895290000000006,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3798 [IDA])): 3.1692089999999999},
 'marketcap': {(Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(1251 [CAJ])): 21652842873.0,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(1315 [CBT])): 1471969134.0,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3212 [GILD])): 98467576445.0,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3460 [HAS])): 9171245233.0,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3798 [IDA])): 4308534238.0},
 'payout ratio': {(Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(1251 [CAJ])): 138.62,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(1315 [CBT])): 63.009999999999998,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3212 [GILD])): 59.719999999999999,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3460 [HAS])): 65.930000000000007,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3798 [IDA])): 55.530000000000001},
 'pe_ratio': {(Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(1251 [CAJ])): 18.307486999999998,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(1315 [CBT])): 11.858447,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3212 [GILD])): 18.533175,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3460 [HAS])): 16.528395,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3798 [IDA])): 18.540130000000001},
 'price': {(Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(1251 [CAJ])): 19.949999999999999,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(1315 [CBT])): 25.969999999999999,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3212 [GILD])): 78.210000000000008,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3460 [HAS])): 67.060000000000002,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3798 [IDA])): 85.480000000000004},
 'sector': {(Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(1251 [CAJ])): True,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(1315 [CBT])): True,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3212 [GILD])): True,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3460 [HAS])): True,
  (Timestamp('2020-04-06 00:00:00+0000', tz='UTC', offset='C'),
   Equity(3798 [IDA])): True}}

kindly post a subset of ur data, and not links or pics. [guide](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — sammywemmy, Apr 08 '20 at 01:03
So I tried following the advice in the post which you linked but the online IDE that im using has blacklisted `pd.read_clipboard(sep='\s\s+') ` Do you have other advice on how I may help you more here? — andres, Apr 08 '20 at 02:26
sammywemmy, thank you for your response. I have updated my original post with the dictionary. — andres, Apr 08 '20 at 03:54

Eric Truett · Accepted Answer · 2020-04-08T02:43:50.477

0

Use a regular expression with str.extract

# I'm not sure of the column name, but it appears to be
# the second column if you reset the index
df.reset_index().iloc[:, 1].str.extract(r'\[([A-Z_]+)\]')

edited Apr 08 '20 at 02:43

answered Apr 08 '20 at 01:10

Eric Truett

2,970
1
16
21

Thank you for your comment, when I followed your advice I received 36 NaN values. Which makes me think if the issue is the iloc? – andres Apr 08 '20 at 02:18
I think that means that there are 36 lines that don't match the regular expression. You could also try ```r'\[(.*?)\]'``` if the modified code still produces NaN values. – Eric Truett Apr 08 '20 at 02:43
that also did not work. I have updated the original post with a dictionary of the first 5 rows, that might help with the dataframe interpretation. This sure is a tricky one. – andres Apr 08 '20 at 03:55
yea, read_clipboard() just failed with this one. not sure what other suggestion there is for multi index. have a look at this [link](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for guidance; there should be a post in there for multi indexes – sammywemmy Apr 08 '20 at 04:25

Extracting indices from Multi-level DataFrame as strings in a list

1 Answers1