Grouped Column Operations in Python using Pandas

Question

I have a data frame consisting of a .csv import that contains n number of trials. Trials are arranged by column with a header (wavelength1 for trial 1, wavelength2 for trial 2 etc.) We're tracking the absorption of a solution over time during a chemical reaction. You can see a SS of the excel file in the link. Trials are grouped in to threes (with g of sugar being the IDV and the absorbance in nm being the DV). For each trial:

I need to determine what the maximum and minimum values are. This can of course be done using max() and min() but when we are sampling every 0.25 seconds, the data can be noisy, meaning that I have to smooth it out. I have already built a function to do that. We're also probably just going to sample every one second as it's much smoother anyway.
Each group of three trials needs to be plotted on the same graph for comparison. n number of trials will create n/3 graphs.

I'm coming from an intermediate background in MATLAB. This is not something I was ever able to figure out in there, either.

What have I done so far?

I have attempted to make a list out of the header for each trial, and then use use a for loop to move through the data using the df.column_name command:

data = pd.read_csv('data.csv')
col_name = data.columns.values
print(col_name)
for i in col_name:
    print(data.col_name[i])

The code works up to the 4th line, where it returns the error: AttributeError: 'DataFrame' object has no attribute 'col_name'. Here is where I would like to make a series or set (whatever it's called here) with all of the values from the wavelength1 trial to plot/manipulate/etc. It's worth noting that I have gotten the multiple plots and multiple lines to work manually: but I want to automate it as that's ofc the point of coding. Here's one out of four graphs of the 'manual' version:

import pandas as pd
import matplotlib.pyplot as plt
#import matplotlib as matplotlib
data = pd.read_csv('data.csv')

plt.close("all")

n_rows = 2
n_columns = 2

#initialize figure
figure_size = (30,15)
font_size = 13
f, ([plt1, plt2], [plt3, plt4]) = plt.subplots(n_rows,n_columns, figsize = figure_size)

#plot first three runs 
x=data.time1
y=data.wavelength1
plt1.plot(x,y, label='Trial 1')
x=data.time2
y=data.wavelength2
plt1.plot(x,y,label='Trial 2')
plt1.set_title('0.3g Glucose', fontweight="bold", size=font_size)
x=data.time3
y=data.wavelength3
plt1.plot(x,y,label='Trial 3')
plt1.set_ylabel('Wavelength (nm)', fontsize = font_size)
plt1.set_xlabel('Time (s)', fontsize = font_size)
plt1.legend(fontsize=font_size)

My first thought was just to do:

for i in range (0,num_col):
    plot(time,data.wavelength(i))

But this does not work. I'm sure it's something quite simple but it is escaping me.

Example data:

https://ufile.io/ac226vma

Thanks in advance! [1]: https://i.stack.imgur.com/gMtBN.png

Interesting idea that Johan had, but it doesn't work either: ``` for i in col_name: print(data[col_name[i]]) ``` returns: "IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices" — a tired engineer, Sep 10 '21 at 22:37

score 1 · Accepted Answer · answered Sep 10 '21 at 22:35

Analysis

I need to determine what the maximum and minimum values are.

Since you want the largest value within each trial, and each trial is represented by one column, you can use DataFrame.min() to get the smallest value in each column. If you want to know the index of the smallest value, you can throw in idxmin() too. Same idea with max.

df = pd.read_csv("data.csv")

# Get max and min values
print("ANALYSIS OF MIN AND MAX VALUES")
analysis_df = pd.DataFrame()
analysis_df["min"] = df.min()
analysis_df["min_idx"] = df.idxmin()
analysis_df["max"] = df.max()
analysis_df["max_idx"] = df.idxmax()
print(analysis_df)

produces:

ANALYSIS OF MIN AND MAX VALUES
                min  min_idx    max  max_idx
wavelength1   801.0      120  888.0        4
wavelength2   809.0       85  888.0        1
wavelength3   728.0       96  837.0        1
wavelength4   762.0      114  864.0        3
wavelength5   785.0      115  878.0        2
wavelength6   747.0      118  866.0        1
wavelength7   748.0      119  851.0        3
wavelength8   776.0      113  880.0        0
wavelength9   812.0      112  900.0        0
wavelength10  770.0      110  863.0        1
wavelength11  759.0      100  858.0        0
wavelength12  787.0       91  876.0        0
wavelength13  756.0       66  862.0        2
wavelength14  809.0       70  877.0        1
wavelength15  828.0       62  866.0        0

Plotting

Each group of three trials needs to be plotted on the same graph for comparison. n number of trials will create n/3 graphs.

This is easier if you break it up into a few smaller subproblems.

First, you want to take a list of all of your columns and break them up into groups of three. I copied the code to do this from here.

def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return itertools.zip_longest(fillvalue=fillvalue, *args)

Now, once we have a group of three column names, we need to get the values within the dataframe associated with those columns. Also, since your datafile contains unequal numbers of observations per trial, we need to get rid of the NaN's at the end of the file.

def get_trials(df, column_group_names):
    """Get columns from dataframe, dropping missing values."""
    column_group = df[list(column_group_names)]
    column_group = column_group.dropna(how='all')
    return column_group

Now, let's combine those two functions:

col_iterator = grouper(3, df.columns)
[...]
for column_group_names in col_iterator:
    column_group = get_trials(df, column_group_names)
    [...]

This will let us loop over the columns in groups of three, and plot them individually. Since we've filtered it down to the data we're interested in, we can use DataFrame.plot to plot it to the matplotlib plot.

Next, we need to loop over the subplots. This is a little annoying to do while also looping over groups, so I like to define an iterator.

def subplot_axes_iterator(n_rows, n_columns):
    for i in range(n_rows):
        for j in range(n_columns):
            yield i, j

Example of it in use:

>>> list(subplot_axes_iterator(2, 2))
[(0, 0), (0, 1), (1, 0), (1, 1)]

Now, combine those pieces:

# Plot data
n_rows = 2
n_columns = 3
figure_size = (15, 10)
font_size = 13
fig, axes = plt.subplots(n_rows, n_columns, figsize=figure_size)

col_iterator = grouper(3, df.columns)
axes_iterator = subplot_axes_iterator(n_rows, n_columns)
plot_names = [
    "Group 1",
    "Group 2",
    "Group 3",
    "Group 4",
    "Group 5",
]

for column_group_names, axes_position, plot_name in \
        zip(col_iterator, axes_iterator, plot_names):
    print(f"plotting {column_group_names} at {axes_position}")
    column_group = get_trials(df, column_group_names)
    column_group.plot(ax=axes[axes_position])
    axes[axes_position].set_title(plot_name, fontweight="bold", size=font_size)
    axes[axes_position].set_xlabel("Time (s)", fontsize=font_size)
    axes[axes_position].set_ylabel("Wavelength (nm)", fontsize=font_size)
plt.tight_layout()
plt.show()

(By the way, you said that you want 4 graphs, but the dataset posted has fifteen trials, so I made 5 graphs.)

Final script

(Included for easy copy/paste.)

import pandas as pd
import matplotlib.pyplot as plt
import itertools


def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return itertools.zip_longest(fillvalue=fillvalue, *args)


def get_trials(df, column_group_names):
    """Get columns from dataframe, dropping missing values."""
    column_group = df[list(column_group_names)]
    column_group = column_group.dropna(how='all')
    return column_group


def subplot_axes_iterator(n_rows, n_columns):
    for i in range(n_rows):
        for j in range(n_columns):
            yield i, j



df = pd.read_csv("data.csv")

# Get max and min values
print("ANALYSIS OF MIN AND MAX VALUES")
analysis_df = pd.DataFrame()
analysis_df["min"] = df.min()
analysis_df["min_idx"] = df.idxmin()
analysis_df["max"] = df.max()
analysis_df["max_idx"] = df.idxmax()
print(analysis_df)


# Plot data
n_rows = 2
n_columns = 3
figure_size = (15, 10)
font_size = 13
fig, axes = plt.subplots(n_rows, n_columns, figsize=figure_size)

col_iterator = grouper(3, df.columns)
axes_iterator = subplot_axes_iterator(n_rows, n_columns)
plot_names = [
    "Group 1",
    "Group 2",
    "Group 3",
    "Group 4",
    "Group 5",
]

for column_group_names, axes_position, plot_name in \
        zip(col_iterator, axes_iterator, plot_names):
    print(f"plotting {column_group_names} at {axes_position}")
    column_group = get_trials(df, column_group_names)
    column_group.plot(ax=axes[axes_position])
    axes[axes_position].set_title(plot_name, fontweight="bold", size=font_size)
    axes[axes_position].set_xlabel("Time (s)", fontsize=font_size)
    axes[axes_position].set_ylabel("Wavelength (nm)", fontsize=font_size)
plt.tight_layout()
plt.show()

Grouped Column Operations in Python using Pandas

1 Answers1

Analysis

Plotting

Final script