1

I have a dataframe with values representing one item's correspondence with another item (by percentage), for example, the amount of characters in one string matching another string. Here is some sample data:

**pident**
100
100
51.515
55.405
20
91.667
86.207
58.621
77.778

I would like to represent this in a cumulative histogram that will show the number of matching items, but since it's percentages, each one needs to contain the one above it, and not the one below, which is the opposite of what the cumulative option does by default. For example, in the bin of the 90 percentile, it should contain all of the items above 90.

In order to work around the fact the the accumilation is done from the lower values to the higher values, I manipulated the values themselves:

df1['abs_pident'] = np.abs(100 - df['pident'])
sns.histplot(data=df1, x=df1['abs_pident'], cumulative='True', bins=10)

This way, the bars show correctly, but the x axis values are wrong and this is where I am stuck. The values need to go down from 100 (highest) to 20 (lowest) and I can't find a way to do that. Any idea of how to that or create the chart the way I want it without having to manipulate the data, will be highly appreciated :)

This is how my chart looks like right now:

Chart

Quang Hoang
  • 146,074
  • 10
  • 56
  • 74

2 Answers2

2

You can use a formatter to reverse the 100-pident operation and invert the axis:

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
df1 = pd.DataFrame()

df1["pident"] = [100, 100,51.515,55.405,20,91.667,86.207,58.621,77.778]
df1['abs_pident'] = np.abs(100 - df1['pident'])
sns.histplot(data=df1, x=df1['abs_pident'], cumulative='True', bins=10)

def numfmt(x, pos): # your custom formatter function: divide by 100.0
    s = '{}'.format(100-x)
    return s

import matplotlib.ticker as tkr     # has classes for tick-locating and -formatting
xfmt = tkr.FuncFormatter(numfmt)    # create your custom formatter function

# your existing code can be inserted here

plt.gca().xaxis.set_major_formatter(xfmt)
plt.gca().invert_xaxis()

enter image description here

Note

Not sure if it is intended, but it might be better to explicitly set bin edges instead of bin count. Setting bins=range(0,110,10) will give you bin edges at multiples of 10 upt to 100 (note how the bin edges in the above plot are at values not directly readable from the plot) enter image description here

FlyingTeller
  • 17,638
  • 3
  • 38
  • 53
0

This is how I resolved the issue with the assistance of the answer given here:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as tkr  # has classes for tick-locating and -formatting

def numfmt(x, pos):  # your custom formatter function: divide by 100.0
    s = '{}'.format(100 - x)
    f = float(s)
    i = int(f)
    return str(i)

def main():
    datafile = "Data.csv"
    df = pd.read_csv(datafile)
    df1 = df[['pident']].copy()
    df1['abs_pident'] = np.abs(100 - df['pident'])
    ax = sns.histplot(data=df1, x=df1['abs_pident'], cumulative='True', bins=10)

    xfmt = tkr.FuncFormatter(numfmt)  # create your custom formatter function

    plt.gca().xaxis.set_major_formatter(xfmt)
    plt.gca().invert_xaxis()
    
    ax.invert_xaxis()

    plt.show()

main()

And this is how the chart looks like: enter image description here