Comparing rows from a Pandas DataFrame depending on certain attribute

Question

The task is to compare two production lines A and B with respect to a performance indicator called OEE. I suspect that there is an impact by the article and line, so I want to compare only articles that have been produced at least once on both lines (here: hammer, drill, pliers, widget). This may later enable to tell if certain articles perform better on one of both lines.

I have managed by iterarion to mark only articles that have run on both lines at least once. But I wonder: is there not a more elegant way to do it?

Interation is not considered the best way as I found out here: https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758#55557758

But perhaps the DataFrame can be modified in a simpler way?

Python 3.9.7
IPython 7.29.0
Pandas 1.3.4

What I did - but it feels "clumsy":

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# create test data

data = {"batch": [100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
        "production line": ["A", "A", "B", "A", "B", "B", "A", "B", "B", "B"],
        "article": ["hammer", "drill", "hammer", "pliers", "pliers", "pliers", "hammer", "hammer", "hammer", "widget"],
        "OEE": np.random.rand(10)}
df = pd.DataFrame(data, columns = ["batch", "production line", "article","OEE"])

# add a helper column
df["comparable"] = False


# check for each line if the article exists on the other line
for i in range (len(df)):
    production_line = df.loc[i, "production line"]
    article = df.loc[i, "article"]
    k = i
    while k < len(df):
        if production_line != df.loc[k, "production line"] and article == df.loc[k, "article"]:
            df.loc[i, "comparable"] = True
            df.loc[k, "comparable"] = True
        k +=1

# create reduced dataset
reduced = df[df["comparable"] == True]

sns.stripplot(x= reduced["production line"], y=reduced.OEE, data=reduced, size=16, alpha=.2, jitter=True, edgecolor="none")

score 0 · Answer 1 · answered Feb 07 '23 at 01:23

0

you can pivot on article with pivot_table, then drop any rows with some missing data for either line, and unpivot with stack

reduced = (
    df.pivot_table(
        index=["article"],
        columns=["production line"],
        values=["OEE"],
    )
    .dropna()
    .stack()
    .reset_index()
)

reduced:

	article	production line	OEE
0	hammer	A	0.0775703
1	hammer	B	0.328319
2	pliers	A	0.201308
3	pliers	B	0.0547307

answered Feb 07 '23 at 01:23

onepan

946
5
8

Thank you for your efforts, but the result is not as expected. You lose data regarding the hammers. There should be 5 datapoints left at the end. – RogerWilco77 Feb 07 '23 at 21:20

score 0 · Accepted Answer · answered Feb 07 '23 at 01:29

0

You can groupby 'article', and then filter for the number of unique items in 'production line':

reduced = df.groupby("article").filter(lambda x: x["production line"].nunique() == 2)

answered Feb 07 '23 at 01:29

Andreas

8,694
3
14
38

Comparing rows from a Pandas DataFrame depending on certain attribute

2 Answers2