-1

I have 2 dataframes with the name merged and initial. The second one is a subset of the first one. I am plotting the histograph of each column of both datasets to compare them. I see some differences in values of the second dataframe that shouldn't exist as the second one is a subset of the first one. To ensure my column's values I printed the values of both dataframes. So for the column fragC I have the following values [13.01 46.03 12.05 64.08 14.04] and [13.01 64.08] As you can see the second one is a subset of the first one. when I am plotting the histogram I am receiving this enter image description here

The OPERA is the second dataframe. This is weird as for the second dataframe it looks that there are values that do not exist in the first one but this is not true. I am plotting using the below code

for column in common_columns:
    # Exclude the excluded_columns from the comparison
    if column not in excluded_columns:
        print("")
        our_values = df1[column].values
        opera_values = df2[column].values
        print(column)
        print(our_values)
        print(opera_values)
        # Plot the distribution for df1 and df2
        plt.figure(figsize=(10, 6))
        plt.hist(df1[column], bins=20, alpha=0.5, label='our dataset')
        plt.hist(df2[column], bins=20, alpha=0.5, label='OPERA')
        plt.xlabel('Values')
        plt.ylabel('Frequency')
        plt.title(f'Distribution Comparison for Column: {column}')
        plt.legend()
        plt.tight_layout()
        plt.show()

The column size of the dataframes are extremely large but below I provide only the specific column

{0: 13.01, 1: 46.03, 2: 12.05, 3: 64.08, 4: 14.04}
{0: 13.01, 1: 64.08}
C.D.
  • 19
  • 6
  • This question is not reproducible without **df1** and **df2**. This question needs a [SSCCE](http://sscce.org/). Please see [How to provide a reproducible dataframe](https://stackoverflow.com/q/52413246/7758804), then **[edit] your question**, and paste the clipboard into a code block. Always provide a [mre] **with code, data, errors, current output, and expected output, as [formatted text](https://stackoverflow.com/help/formatting)**. If relevant, plot images are okay. If you don't include a mre, it is likely the question will be downvoted, closed, and deleted. – Trenton McKinney May 16 '23 at 01:05
  • Unless it's exactly the same data, why do you expect the bins to be the same? Also you're plotting two DataFrame on top of each other. – Trenton McKinney May 16 '23 at 01:11
  • Thank you Trenton. I know that I am plotting two dataframes on top of each other, that is exactly what I want to do as I want to compare the variance of values of each column for both dataframes. Could you please explain that you are asking "why do I expect the bins to be the same"? Thank you – C.D. May 16 '23 at 01:11

1 Answers1

0

The reason is that the bin spread is different. The first dataset has 20 bins running from 12.05 to 64.08. The second dataset has 20 bins running from 13.01 to 64.08.

If you want the bins to start at 0, you need to specify that, with range or bins.

https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html

Tim Roberts
  • 48,973
  • 4
  • 21
  • 30
  • Thank you very much Tim for your response. It makes sense. Could you add please how can I set to start at 0 for both graphs? – C.D. May 17 '23 at 11:42
  • Did you check the documentation? The `hist` function has a `bins` parameter that lets you specify the bin edges directly, and a `range` parameter that lets you specify the range. It even explains the default. https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html – Tim Roberts May 17 '23 at 19:02