1

I can very well plot CDF and CCDF when the data is in one column. But I am a little clueless how to plot a CDF or CCDF when the data is in the below given format. The pairs in round brackets () are the node pairs. The values in square brackets [] are the occurrence value and the number in between eg: 7 are the frequency. We don't have consider the frequency, only the occurrence values.

Input data format, They are millions of rows with lot of values between the square braces ([]).

('4503', '656') 7 [2473.0, 35.0, 235.0, 157.0, 505.0, 45.0, 1303.0] 
('2105', '674') 1 [2584.0] 
('5139', '1086') 1 [1488.0] 
('3690', '2034') 6 [1009.0, 1108.0, 132.0, 447.0, 157.0, 466.0] 
('3867', '1982') 1 [1134.0] 

I have to plot the CCDF of the data which is between the square braces ([]) all together and not separately. I am not understanding how do I read the data between between the square braces and plot it.

2 Answers2

0

Your problem is mainly to put the input data to the right format:

  1. step: Parse the input file for the data you need, namely the values between the square brackets: This can easily be done with regular expressions using the re module from the python standard lib. Write them all space delimited to a text file.

  2. step: assign all these values to a fast numpy array and plot them like described here: Read file and plot CDF in Python

Community
  • 1
  • 1
barrios
  • 1,104
  • 1
  • 12
  • 21
  • If I am using `re` module to parse the input the pair from the round brackets and those from square brackets get mixed and it is really a mess to identify what is what. –  Oct 15 '14 at 18:28
0

You can do this by simply finding the index of the [ and ], slicing out the data line by line and parsing it to a list using ast.literal_eval and appending it to the main list.

import ast
import numpy as np
from pylab import *

file_data = """('4503', '656') 7 [2473.0, 35.0, 235.0, 157.0, 505.0, 45.0, 1303.0] 
('2105', '674') 1 [2584.0] 
('5139', '1086') 1 [1488.0] 
('3690', '2034') 6 [1009.0, 1108.0, 132.0, 447.0, 157.0, 466.0] 
('3867', '1982') 1 [1134.0] """

data = []

for line in file_data.splitlines():
    data += ast.literal_eval(line[line.find('['):line.find(']')+1])

Once you have done the above, you should be able to plot the CDF as follows :

# Building an array of uniform x points ranging from 0 to the max(data)
X  = np.arange(0,max(data), max(data) / len(Y))

# Convert data to a numpy array
Y  = np.array(sorted(data))

# Normalizing data to yield a proper PDF vector
Y /= Y.sum()

# CDF can be obtained by the `np.cumsum` method:
Yc = np.cumsum(Y)

# Plot Y vs X
plot(X,Y,color="green" )

# Plot the CDF
plot(X,Yc,color="red"   )

# Display the plot
show()

The following is obtained for the above data :

enter image description here

Raghav RV
  • 3,938
  • 2
  • 22
  • 27
  • This looks like it should work. I am a little confused to understand but I shall check it with full data and reply back. Thanks !! –  Oct 15 '14 at 20:57
  • I changed this `file_data = np.loadtxt('Input_File')` But I get this error `Traceback (most recent call last): File "new_line.py", line 5, in file_data = np.loadtxt('Input_File') File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 796, in loadtxt items = [conv(val) for (conv, val) in zip(converters, vals)] ValueError: could not convert string to float: ('785',` –  Oct 16 '14 at 09:22
  • load the entire data file as a string in the `file_data` variable. Use `f = open('Input_File'); file_data = f.read(); f.close()` and proceed as described above ! See if that helps ... – Raghav RV Oct 16 '14 at 18:08