1

I have multiple data frames in this format:

year    count   cum_sum
2001    5   5
2002    15  20
2003    14  34
2004    21  55
2005    44  99
2006    37  136
2007    55  191
2008    69  260
2009    133 393
2010    94  487
2011    133 620
2012    141 761
2013    206 967
2014    243 1210
2015    336 1546
2016    278 1824
2017    285 2109
2018    178 2287

I have generated a plot as the followig: enter image description here

The following code has been utilized for this purpose:

fig, ax = plt.subplots(figsize=(12,8))

sns.pointplot(x="year", y="cum_sum", data=china_papers_by_year_sorted, color='red')
sns.pointplot(x="year", y="cum_sum", data=usa_papers_by_year_sorted, color='blue')
sns.pointplot(x="year", y="cum_sum", data=korea_papers_by_year_sorted, color='lightblue')
sns.pointplot(x="year", y="cum_sum", data=japan_papers_by_year_sorted, color='yellow')
sns.pointplot(x="year", y="cum_sum", data=brazil_papers_by_year_sorted, color='green')

ax.set_ylim([0,2000])
ax.set_ylabel("Cumulative frequency")

fig.text(x = 0.91, y = 0.76, s = "China", color = "red", weight = "bold") #Here I have had to indicate manually x and y coordinates
fig.text(x = 0.91, y = 0.72, s = "South Korea", color = "lightblue", weight = "bold") #Here I have had to indicate manually x and y coordinates

plt.show()

The problem is that the method for adding text to the plot is not recognizing the data coordinates. So, I have had to manually indicate the coordinates of the labels of each dataframe (please see "China" and "Korea"). Is there a clever way of doing it? I have seen an example using ".last_valid_index()" method. However, since the data coordinates are not being recognized, it is not working.

  • You can use rescaled data coordinates. Just divide the last y-value with the maximum y value. That will give you the rescaled coordinates – Sheldore Oct 29 '18 at 18:01
  • 1
    See [this question](https://stackoverflow.com/questions/49237522/annotate-end-of-lines-using-python-and-matplotlib). – ImportanceOfBeingErnest Oct 29 '18 at 18:44
  • I guess it makes sense to close this as duplicate. @Fernando If you have problems implementing the linked solution for your case, it would make sense to ask a new question about the specific problem, containing a [mcve] of the issue. – ImportanceOfBeingErnest Oct 29 '18 at 20:27

1 Answers1

0

You don't need to make repeated calls to pointplot and add labels manually. Instead add a country column to your data frames to indicate the country, combine the data frames and then simply plot cumulative sum vs year using country as the hue.

Instead, do the following:

# Add a country label to dataframe itself
china_papers_by_year_sorted['country'] = 'China'
usa_papers_by_year_sorted['country'] = 'USA'
korea_papers_by_year_sorted['country'] = 'Korea'
japan_papers_by_year_sorted['country'] = 'Japan'
brazil_papers_by_year_sorted['country'] = 'Brazil'

# List of dataframes with same columns
frames = [china_papers_by_year_sorted, usa_papers_by_year_sorted,
          korea_papers_by_year_sorted, japan_papers_by_year_sorted,
          brazil_papers_by_year_sorted]

# Combine into one dataframe
result = pd.concat(frames)

# Plot.. hue will make country name a label
ax = sns.pointplot(x="year", y="cum_sum", hue="country", data=result)
ax.set_ylim([0,2000])
ax.set_ylabel("Cumulative frequency")
plt.show()

Edit: Editing to add that if you want to annotate the lines themselves instead of using the legend, the answers to this existing question indicate how to annotate end of lines.

Abhinav Sood
  • 799
  • 6
  • 23