I have 2 dataframes with the name merged and initial. The second one is a subset of the first one. I am plotting the histograph of each column of both datasets to compare them. I see some differences in values of the second dataframe that shouldn't exist as the second one is a subset of the first one. To ensure my column's values I printed the values of both dataframes. So for the column fragC I have the following values
[13.01 46.03 12.05 64.08 14.04] and
[13.01 64.08]
As you can see the second one is a subset of the first one. when I am plotting the histogram I am receiving this
The OPERA is the second dataframe. This is weird as for the second dataframe it looks that there are values that do not exist in the first one but this is not true. I am plotting using the below code
for column in common_columns:
# Exclude the excluded_columns from the comparison
if column not in excluded_columns:
print("")
our_values = df1[column].values
opera_values = df2[column].values
print(column)
print(our_values)
print(opera_values)
# Plot the distribution for df1 and df2
plt.figure(figsize=(10, 6))
plt.hist(df1[column], bins=20, alpha=0.5, label='our dataset')
plt.hist(df2[column], bins=20, alpha=0.5, label='OPERA')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title(f'Distribution Comparison for Column: {column}')
plt.legend()
plt.tight_layout()
plt.show()
The column size of the dataframes are extremely large but below I provide only the specific column
{0: 13.01, 1: 46.03, 2: 12.05, 3: 64.08, 4: 14.04}
{0: 13.01, 1: 64.08}