Python histogram from unsorted data

Question

I am trying to parse some data to generate a histogram

The data is in multiple columns but the only relevant column for me are the two below.

X

AB    42

CD    77

AB    33

AB    42

AB    33

CD    54

AB    33

Only for the rows with AB, I want to plot the histogram of the value in col 2. So the histogram should sort and plot:

33 - 3

42 - 2

(even though 42 occurs first, I want to plot 33 first).

I have a lot of columns but it needs to grep the 'AB' character and only search in those rows. Can anyone help?

UPDATE: Data is in a csv file and there are several columns.

EDIT: I now have the data in a csv file in this format.

Addresses,Data

FromAP,42

FromAP,33

ToAP,77

FromAP,54

FromAP,42

FromAP,33

ToAP,42

FromAP,42

FromAP,33

If I use the code from @dranxo,

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv', sep=',')

df_useful = df[df['Addresses'] == 'FromAP']

df_useful.hist()
plt.show()

I get the following error:

Laptop@ubuntu:~/temp$ ./a.py
/usr/lib/pymodules/python2.7/matplotlib/axes.py:8261: UserWarning: 2D hist input should be nsamples x nvariables;
 this looks transposed (shape is 0 x 1)
  'this looks transposed (shape is %d x %d)' % x.shape[::-1])
Traceback (most recent call last):
  File "./a.py", line 11, in <module>
    df_useful.hist()
   File "/usr/lib/python2.7/dist-packages/pandas/tools/plotting.py", line 2075, in hist_frame
    ax.hist(data[col].dropna().values, **kwds)
  File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 8312, in hist
    xmin = min(xmin, xi.min())
  File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 21, in _amin
    out=out, keepdims=keepdims)
ValueError: zero-size array to reduction operation minimum which has no identity

I do have the pandas package, numpy, matplotlib installed. Thanks

Yes. In a csv or a tab delimited file. And it is a large file... about 100K entries. — mane, Oct 08 '14 at 21:06
@mane Please update your question to specify that the data is in CSV. Makes it easier for future readers than parsing your comment. It also helps to mention in the question what you have tried so far. http://stackoverflow.com/help/how-to-ask — Jay, Oct 08 '14 at 21:18

score 2 · Answer 1 · edited May 23 '17 at 12:05

The following code sample will work. Please note reading the CSV may be a little different in your sitatuion depending on its exact format. See this question for reading a CSV.

import csv
with open("/tmp/test.csv", "r") as f:
    #Filter the result for "AB" as we read the lines from the file
    filtered_result = [tuple(line) for line in csv.reader(f) if line[0] == "AB"]

#Now, sort the result by the second column
final_result = sorted(filtered_result,key=lambda x: x[1])

#Print it for inspection
for key, value in final_result:
    print "key: %s, value: %s" % (key, value)

Output:

key: AB, value: 33
key: AB, value: 33
key: AB, value: 33
key: AB, value: 42
key: AB, value: 42

Contents of /tmp/test.csv:

AB,42
CD,77
AB,33
AB,42
AB,33
CD,54
AB,33

I populated /tmp/test.csv with 100,000 lines of random data, and here is how long my script takes:

$ time python test.py 

real    0m0.073s
user    0m0.073s
sys 0m0.000s

Edit: Updated for better performance and to show example of CSV
Edit: Updated again to be even faster

score 1 · Answer 2 · answered Oct 08 '14 at 21:19

There are two different problems:

Parse CSV - Python has an inbuilt library for CSV.
Graph your results - Does your Python program need to generate the histogram? Or is it acceptable to put your parsed CSV into some spreadsheet software and do it there?

If you have to have your Python program generate the histogram, then here's a list of graphing libraries to get you started.

Hasan Ramezani · Answer 3 · 2014-10-08T21:42:11.793

1

I suppose data is in file.csv and AB is in first column and 42 is in second column

import csv
reader = csv.reader(open('file.csv', 'r'))
dic = {}
for row in reader:
    if row[0] == 'AB':
        value = int(row[1])
        if  value in dic.keys():
            dic[value] += 1
        else:
            dic[value] = 1

#sorted print 
for key in sorted(dic):
    print '%s-%s'%(key, dic[key])

edited Oct 08 '14 at 21:42

answered Oct 08 '14 at 21:20

Hasan Ramezani

5,004
24
30

Lets say I have several columns and I am only interested in these two columns where the column header is, say, 'Sender' and 'Value'. How do I extract only these two values into the dict? – mane Oct 08 '14 at 22:13
you can access column by its number,like list items. count from `zero`. – Hasan Ramezani Oct 09 '14 at 09:47

dranxo · Accepted Answer · 2014-10-10T18:16:17.197

1

Have you ever looked into pandas?

Here's how to parse the data and plot in a few lines:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.ssv', sep=' ')

df_useful = df[df['letters'] == 'AB']

df_useful.hist()
plt.show()

enter image description here

Note: I saved your data into a file called 'data.ssv' before calling pd.read_csv. Here's that file:

letters numbers

AB 42

CD 77

AB 33

AB 42

AB 33

CD 54

AB 33

edit: To check that the problem isn't with the data you can run this code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.DataFrame(np.round(np.random.randn(10, 2)),
                 columns=['a', 'b'])

df.hist()
plt.show()

edited Oct 10 '14 at 18:16

answered Oct 08 '14 at 21:31

dranxo

3,348
4
35
48

I am looking at pandas lib and it looks cool with tons of methods. Thanks for the nice and short code. – mane Oct 08 '14 at 22:14
1

Sure. Btw to reduce to only the columns you want just do df = df[['letters','numbers']] (or whatever they're called) after pd.read_csv – dranxo Oct 08 '14 at 22:16
I'm having trouble reading that code/data in the comment. Can you edit your question instead? – dranxo Oct 08 '14 at 23:45

Hi @dranxo, I used your code and changed the delimiter to ',' The file has


Addresses,Data
FromAP,42
FromAP,42
FromAP,33
ToAP,77
FromAP,54
FromAP,42
FromAP,42
FromAP,33
ToAP,42
FromAP,42
FromAP,33


I am seeing the following error:

– mane Oct 08 '14 at 23:47

/usr/lib/pymodules/python2.7/matplotlib/axes.py:8261: UserWarning: 2D hist input should be nsamples x nvariables; this looks transposed (shape is 0 x 1) 'this looks transposed (shape is %d x %d)' % x.shape[::-1]) Traceback (most recent call last): File "./a.py", line 11, in df_useful.hist() File "/usr/lib/python2.7/dist-packages/pandas/tools/plotting.py", line 2075, in hist_frame ax.hist(data[col].dropna().values, **kwds) – mane Oct 08 '14 at 23:47
File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 8312, in hist xmin = min(xmin, xi.min()) File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 21, in _amin out=out, keepdims=keepdims) ValueError: zero-size array to reduction operation minimum which has no identity – mane Oct 08 '14 at 23:48
Sorry... I am having trouble formatting the code and the error is too long to put in one message blow. – mane Oct 08 '14 at 23:49
Ok, well, I just ran your edited code on the data you just posted and it worked perfectly. Are you sure you are posting the correct dataset? You mentioned that you have many columns but the data you posted only has 2... – dranxo Oct 09 '14 at 00:09
I am still trying only on the 2-column dataset, but still seeing this error. Also, it shows File "/usr/lib/python2.7/dist- packages/numpy/lib/function_base.py", line 176, in histogram mn, mx = [mi+0.0 for mi in range] TypeError: cannot concatenate 'str' and 'float' objects – mane Oct 09 '14 at 00:15
It might be the dataset, I edited my answer to show how to use hist() without reading an external file. Does that code work on your machine? – dranxo Oct 10 '14 at 18:17

Python histogram from unsorted data

4 Answers4