1

I'm new to Python and would need some help about how to plot data in the following format (please see the picture)

I will have a file format like this:

# of IDs \t start_time \t end_time
   428      1404238888      1404314624
   132      1404259731      1404346488
    77      1404347808      1404437873 
    63      1404432707      1404520913
    281     1404518967      1404605334
   .......

Based on recommendations in the comment, I found a way to reduce data by clustering IDs by start and end time. My new file will have the above format where the first column tells how many IDs are in that time frame (from start to end). So I guess a better graph representation for this case is to do bar chart.

My y-axis will be number of ID and x-axis will be time with unit in day (my total measurement time is ~ 3 months)

What I want to show is in what time frame there is the most number of ID clustered. What I want to achieve is something like the image below where at each row in my file, I will draw a bar.

enter image description here

I hope that the image above explain well what I want to achieve. It would be great to let me know how to start graph and set the y-axis and x-axis in the units that I want. Sorry this is the first time I try to graph something in Python. I have other codes written for my project, and got stuck at writing code to graph my end result.

Thank you in advance for any help

LKT
  • 311
  • 1
  • 7
  • 17
  • 3
    A small sample of input numbers, your code, where exactly do you get stuck etc.. would all be a welcome addition to this question. I doubt people will write your program for you. As it looks now all you have to do is parse your file for values, and `plt.plot((start, end), (id, id)` in a `for` loop. – ljetibo Mar 01 '15 at 01:57
  • 3
    You expect to be able to meaningfully view a chart containing on the order of 100 million horizontal lines? Maybe you should consider reducing the data before plotting it! – Hugh Bothwell Mar 01 '15 at 03:03
  • @ljetibo I edited question with sample data. Sorry to not include any code since I don't have it yet for graphing. I don't ask someone to write the program for me. It would greatly help to point me to the right direction and I can get started – LKT Mar 02 '15 at 09:03
  • @HughBothwell Thank you for your suggestion. I have found a way to reduce the data size (please see the edited question). – LKT Mar 02 '15 at 09:04
  • I've had a similar problem to this before and I ended up using `vline` and `hline` to do the job. However, I only had a few data points to worry about then. How many data points do you have? – WGS Mar 02 '15 at 09:12

1 Answers1

2

It's really really simple. You should not have had any problems if you dug a bit in the examples section of matplotlib.. plt.bar(left_edge, height, width) does exactly what you want.

  1. Get what you need.

    import matplotlib.pyplot as plt
    import csv
    

    If your data really is a tab separated file, it should look like this: (yours looks more like a multiple spaces separated file tbh)

    id  start   end
    428 1404238888  1404314624
    132 1404259731  1404346488
    77  1404347808  1404437873 
    63  1404432707  1404520913
    281 1404518967  1404605334
    
  2. Read in the data you have.

    file = open("test.txt", "r")
    reader = csv.DictReader(file, delimiter="\t")
    
    ids = [] #open 3 new lists to hold your data
    start = []
    end = [] 
    for data in reader:
        ids.append(float(data["id"]))
        start.append(float(data["start"]))
        end.append(float(data["end"])-float(data["start"])) #remember: it's "width" not "right edge coordinate"
    
  3. This is the actual plotting.

    fig, ax = plt.subplots()
    w = sum(end)/len(end)/10 #change the width of the bar
    for i in range(len(ids)):
        ax.bar(start[i], ids[i], width=end[i])
    
    plt.show()
    

Since in your question you say it's important that your right edge of a bar ends in the 2nd coordinate it's better to plot with end[i]. However as I show in the graph you have some overlap issues. I.e. first one ends at ....314... while second one starts at ....259... and there's more than just that one.

What you're essentially asking clearly shows that this is wrong though: "I want to make each line in my file into a bar, and I have already stacked the y axis height. And x axis are dates." But apparently you've not done it right because there should be no overlaps in a histogram like that, if there are overlaps that means that the overlap should be added to the height of the bin before.

I've answered a similar question a while back on how to properly handle and stack dates in matplotlib, reading it might help you out. It was done on a mock datetime objects list. Yours look like they have already been converted with date2num but same principles apply (as does the recommendation that you use the hist function and let it take care of the dates.)

Result (python 3, win7, matplotlib 1.3.1):

enter image description here

Community
  • 1
  • 1
ljetibo
  • 3,048
  • 19
  • 25
  • perfect! Thanks a lot. This is a great start for me and I will modify my previous code to get a correct output to plot. One question though, is it possible to use x-axis as date in format mmyyyy instead of epoch time? – LKT Mar 02 '15 at 22:11
  • See the other answer I linked in my response. Twiddling lines `ax.xaxis.set_major_locator(mpl.dates.MonthLocator())` (puts markers where months are) `format = mpl.dates.DateFormatter('%m/%d')` (defines how the markers should look like, see other options like `%m%y` etc..) `ax.xaxis.set_major_formatter(format)` (instructs the x axis to use the defined user format for text display). That should give you what you want. It's better to always try and plot using `datetime` objects because it helps the user avoid a lot of background calculations like step years etc... – ljetibo Mar 02 '15 at 22:16