Correct way to access pandas dataframe

Question

I'm trying to access/create a list of module names from the CEC database accessed by pvlib:

import pandas as pd
import pvlib as pv

cecmod = pv.pvsystem.retrieve_sam('CECMod')

I want to search the list of module names:

matching = [s for s in dir(cecmod) if "Trina" in s]

The dir(cecmod) part bothers me. I've stumbled on this way of getting the list of dataframe column headings (keys?) but I feel like dir isn't meant to be used this way. Why does dir(pandas.DataFrame) return this list of column headings instead of a ? Is this the way dataframes are meant to be used? Is there a better way to access these headings/keys?

What are you trying to do here? What is `matching` and what should it contain? — cs95, Jan 20 '18 at 23:59
For those of us who know no `pvlib`, what is the type, structure, and typical value of `cecmod`? Why would it have a list of module names? As a side note, the list of dataframe `df` column names is in `df.columns` and the row names are in `df.index`. — DYZ, Jan 21 '18 at 00:00
cecmod is a pandas dataframe. I'm not sure I can properly explain dataframes, I'm quite confused about them myself. They're a container for objects (pandas series) an seem to be something like a 2d numpy array where each row and column has a title. retrieve_sam() accesses the internet and returns a dataframe which contains thousands of entries each with it's own title. I want a list of those titles in matching. I'm confused because the behaviour of the dataframe returned from pvlib doesn't seem to match the behaviour as described in pandas tutorials. — GlenS, Jan 21 '18 at 00:09
`cecmod` is a DataFrame. It is a pandas DataFrame; not a `pvlib DataFrame` (I don't know if something like that even exists). You probably made a mistake somewhere. It acts like every other pandas DataFrame. — ayhan, Jan 21 '18 at 00:16

score 1 · Accepted Answer · edited Jan 21 '18 at 00:20

No, this is really bad design. dir(..) is meant to list all attributes of an object. Although that is not always possible since some objects generate attributes on the fly.

It is also a bad idea to check with if "Trina" in s, since it could eventually happen that the search string is in an attribute.

A way to obtain the list of columns is simply use cecmode.columns. Which is an Index(..) object, like:

>>> cecmod.columns
Index(['BEoptCA_Default_Module', 'Example_Module', '1Soltech_1STH_215_P',
       '1Soltech_1STH_220_P', '1Soltech_1STH_225_P', '1Soltech_1STH_230_P',
       '1Soltech_1STH_235_WH', '1Soltech_1STH_240_WH', '1Soltech_1STH_245_WH',
       '1Soltech_1STH_FRL_4H_245_M60_BLK',
       ...
       'Zytech_Solar_ZT275P', 'Zytech_Solar_ZT280P', 'Zytech_Solar_ZT285P',
       'Zytech_Solar_ZT290P', 'Zytech_Solar_ZT295P', 'Zytech_Solar_ZT300P',
       'Zytech_Solar_ZT305P', 'Zytech_Solar_ZT310P', 'Zytech_Solar_ZT315P',
       'Zytech_Solar_ZT320P'],
      dtype='object', length=13953)

It is iterable, and then we iterate over the column names:

matching = [col for col in cecmod.columns if "Trina" in col]

which will yield:

>>> [col for col in cecmod.columns if "Trina" in col]
['Trina_Solar_TSM_165DA01', 'Trina_Solar_TSM_170D', 'Trina_Solar_TSM_170DA01', 'Trina_Solar_TSM_170DA03', 'Trina_Solar_TSM_170PA03', 'Trina_Solar_TSM_175D', 'Trina_Solar_TSM_175DA01', 'Trina_Solar_TSM_175DA03', 'Trina_Solar_TSM_175PA03', 'Trina_Solar_TSM_180D', 'Trina_Solar_TSM_180DA01', 'Trina_Solar_TSM_180DA03', 'Trina_Solar_TSM_180PA03', 'Trina_Solar_TSM_185DA01', 'Trina_Solar_TSM_185DA01A', 'Trina_Solar_TSM_185DA01A_05', 'Trina_Solar_TSM_185DA01A_08', 'Trina_Solar_TSM_185DA03', 'Trina_Solar_TSM_185PA03', 'Trina_Solar_TSM_190DA01A', 'Trina_Solar_TSM_190DA01A_05', 'Trina_Solar_TSM_190DA01A_08', 'Trina_Solar_TSM_190DA03', 'Trina_Solar_TSM_190PA03', 'Trina_Solar_TSM_195DA01A', 'Trina_Solar_TSM_195DA01A_05', 'Trina_Solar_TSM_195DA01A_08', 'Trina_Solar_TSM_200DA01A', 'Trina_Solar_TSM_200DA01A_05', 'Trina_Solar_TSM_200DA01A_08', 'Trina_Solar_TSM_205DA01A', 'Trina_Solar_TSM_205DA01A_05', 'Trina_Solar_TSM_205DA01A_08', 'Trina_Solar_TSM_220DA05', 'Trina_Solar_TSM_220PA05', 'Trina_Solar_TSM_220PA05_05', ...

(output is cut off).

We can also perform faster matching with .str.contains('Trina') like @DYZ says:

list(cecmod.columns[cecmod.columns.str.contains('Trina')])

Here we let the library do the search work, which will usually outperform Python loops.

Alternatively, use str.startswith, assuming the search string resides at the start of your column names:

list(cecmod.columns[cecmod.columns.str.startswith('Trina')])

If you want the dataframe columns, and not just the column names, use df.filter:

df.filter(like='Trina')

Python loops can be avoided altogether with `matching=df.columns[df.columns.str.find('Trina')!=-1]`. — DYZ, Jan 21 '18 at 00:06
Thank you. I knew dir wasn't the way to go but as I said above the pvlib dataframes don't seem to behave like the dataframes described in pandas tutorials. — GlenS, Jan 21 '18 at 00:12
Edited your answer a bit, if that's okay. Feel free to rollback. — cs95, Jan 21 '18 at 00:21
@GlenS pvlib creates pandas dataframes from csv files. There is no such thing as a "pvlib dataframe". — Will Holmgren, Jan 21 '18 at 00:39
Note that the CEC modules DataFrame created by pvlib transposes rows and columns from the original csv file. Searching the module data seems easier to me after I transpose the DataFrame back to having module names in the row index and parameter names in the column index.`cecmod = cecmod.T` — adr, Jan 24 '18 at 18:42

Correct way to access pandas dataframe

1 Answers1