0

I have a dictionary in which key is the length of a particular column and value is the number of times the column value had that particular length. Can anyone tell me how do I plot a boxplot with this data? Do I have to convert it to a list and then call the plot?

from collections import defaultdict
sample_data = [1, 2, 3, 1, 1, 2, 4, 5]

sample_dict = defaultdict(int)
for i in sample_data:
    sample_dict[i] += 1

print(sample_dict)

defaultdict(<class 'int'>, {1: 3, 2: 2, 3: 1, 4: 1, 5: 1})

The above dictionary is how I have my data currently. My dataset size is huge so I used this way of representation. Is converting the dictionary into a list is the way to plot boxplot i.e., do I need to make a list that contains ? TIA!

My dataframe looks like below, (just showing the head here)

|    |   len      |num_occurrence| dt.                 |
|---:|-----------:|-------------:|:--------------------|
|  0 |        183 |        599   | 2022-11-24 00:00:00 |
|  1 |        176 |       1029   | 2022-12-15 00:00:00 |
|  2 |          2 |         24   | 2022-12-02 00:00:00 |
|  3 |         18 |     449343   | 2022-12-09 00:00:00 |
|  4 |         45 |     640937   | 2022-12-09 00:00:00 |

Currently I plot like below,

sns.boxplot(x='dt_formatted', y='subd_len', data=subd_pdf);
plt.title('Distribution of length');
plt.xticks(rotation=90);

But it does not take frequency of occurrence into consideration.

  • From a dictionary you can create a histogram via the `weights` parameter. E.g `sns.histplot(x=sample_dict.keys(), weights=sample_dict.values(), discrete=True)` – JohanC Dec 20 '22 at 21:58
  • Thanks, I have multiple distributions like this for each date that I need to compare. So I was thinking boxplot could help compare those distributions by date. – hypothesisusable Dec 20 '22 at 22:10
  • For a boxplot, it is suggested to use `np.repeat(values, frequencies)` to be used as input. If the frequencies are too high for the result to fit into memory, the frequencies can be divided by some factor. E.g. [How to create a box plot from a frequency table](https://stackoverflow.com/questions/59945084/how-to-create-a-box-plot-from-a-frequency-table) – JohanC Dec 20 '22 at 23:01
  • Thanks, I will try this out! I am guessing `np.repeat` might be taking up too much memory. I might have to divide the frequencies into some factor. Thank you! – hypothesisusable Dec 21 '22 at 17:53
  • See also [How to find quantile from frequency data?](https://stackoverflow.com/questions/47947391/how-to-find-quantile-from-frequency-data). If you replace the data with just 5 data points (min, max, median, first and third quantile) you can create a boxplot. – JohanC Dec 21 '22 at 20:11

0 Answers0