In this link, it says that truncated MD5 is uniformly distributed. I wanted to check it using PySpark and I created 1,000,000 UUIDs in Python first as shown below. Then truncated the first three characters from MD5. But the plot I get is not similar to the cumulative distribution function of a uniform distribution. I tried with UUID1 and UUID4 and the results are similar. What is the right way of conforming the uniform distribution of truncated MD5?
import uuid
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.distributions.empirical_distribution import ECDF
import pandas as pd
import pyspark.sql.functions as f
%matplotlib inline
### Generate 1,000,000 UUID1
uuid1 = [str(uuid.uuid1()) for i in range(1000000)] # make a UUID based on the host ID and current time
uuid1_df = pd.DataFrame({'uuid1':uuid1})
uuid1_spark_df = spark.createDataFrame(uuid1_df)
uuid1_spark_df = uuid1_spark_df.withColumn('hash', f.md5(f.col('uuid1')))\
.withColumn('truncated_hash3', f.substring(f.col('hash'), 1, 3))
count_by_truncated_hash3_uuid1 = uuid1_spark_df.groupBy('truncated_hash3').count()
uuid1_count_list = [row[1] for row in count_by_truncated_hash3_uuid1.collect()]
ecdf = ECDF(np.array(uuid1_count_list))
plt.figure(figsize = (14, 8))
plt.plot(ecdf.x,ecdf.y)
plt.show()
EDIT: I added the histogram. As you can see below, it looks more like normal distribution.
plt.figure(figsize = (14, 8))
plt.hist(uuid1_count_list)
plt.title('Histogram of counts in each truncated hash')
plt.show()