0

So I have this data:

0            25
1         10527
2          2704
3           297
4          2032
          ...  
255845        6
255846        3
255847        2
255848        0
255849    43960

starts at index 0 and ends at 255849. I do something like this to plot it:

    import praw
    import csv
    import pandas as pd
    import matplotlib.pyplot as plt
   
    df = pd.read_csv("hot_post_data.csv")
    #file1 = open("post_data.csv", "r", encoding="utf8")
    
    
    
  
    df_c = pd.DataFrame(df)

    upvotes = []
    for i in range(255850):
        upvotes.append(i)

    print(df_c['Upvotes'].size)
    plt.xlabel('number of upvotes')
    plt.ylabel('posts')
    plt.scatter(upvotes, df_c['Upvotes'])

    plt.show()

and it results in this: enter image description here

Because of having more than 200k elements, I think it literally prints every single one of them. So I thought of a having one dot that represents 10k post or some x number like that. And those dots would be have the average value. For example lets say I have two dots, I'll get the amount of posts with upvotes between 10k - 15k and add their values and divide that by the number of posts. By doing this I'll probably able to reduce more than 200k dots to 20-30 dots?

How can I do this? Thanks

ash2021
  • 63
  • 5
  • It should probably be just `df.reset_index().plot(kind='scatter', x=0, y=1)` where `0` is the name of the index and `1` is the column data based on what is posted. Otherwise use the name of the column that was the index for x and the name of the data column for y. – Trenton McKinney May 25 '22 at 15:52
  • +1 to the above comment. Also you should think on what is the purpose of your visualization. Maybe a histogram is a better way to represent the data. See https://stackoverflow.com/questions/36585563/weights-for-histogram-in-pandas – scespinoza May 25 '22 at 15:55

0 Answers0