-1

I'm trying to plot a histogram for a big data (nearly 7 million points) in python where I want to know the frequency of values. I have tried this code but it takes too long to finish more than an hour! So, are there any suggestions?

import numpy as np
import matplotlib.pyplot as plt

file_path = "D:/results/planarity2.txt" 
data_array = []

with open(file_path, "r") as file:
    for line in file:
        value = line.strip()  
        data_array.append(value)
column_values = data_array 

unique_values, counts = np.unique(column_values, return_counts=True)

value_frequency = dict(zip(unique_values, counts))


x_values = list(value_frequency.keys())
y_values = list(value_frequency.values())

plt.bar(x_values, y_values, edgecolor='black', alpha=0.7)


plt.xlabel('Column Values')
plt.ylabel('Frequency')
plt.title('Frequency of Points Based on Column Values')
plt.show()

I also tried this but no use

import numpy as np
import matplotlib.pyplot as plt

file_path = "D:/results/planarity2.txt" 
data_array = []

with open(file_path, "r") as file:
    for line in file:
        value = line.strip()  
        data_array.append(value)
column_values = data_array 
value_frequency = {}

for value in column_values:
    if value in value_frequency:
        value_frequency[value] += 1
    else:
        value_frequency[value] = 1

x_values = list(value_frequency.keys())
y_values = list(value_frequency.values())

plt.bar(x_values, y_values, edgecolor='black', alpha=0.7)

plt.xlabel('Column Values')
plt.ylabel('Frequency')
plt.title('Frequency of Points Based on Column Values')
plt.show()
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
youssef
  • 1
  • 1
  • Are the values in your input file just numbers or are they strings of text? If they are strings of text, what maximum length are they? – Matt Pitkin Sep 01 '23 at 15:40

1 Answers1

0

I think your main issue is the fact that you seem to be reading in the file and leaving things as strings rather than converting the values into numbers and holding them in a NumPy array (assuming that you values are just numbers?). Having 7 million data points should not be a particular problem. One thing to try first is to just read in the file using the NumPy loadtxt function, which will automatically convert the values to floating point numbers when they are read in and output a NumPy array. E.g., rather than:

file_path = "D:/results/planarity2.txt" 
data_array = []

with open(file_path, "r") as file:
    for line in file:
        value = line.strip()  
        data_array.append(value)
column_values = data_array

just have:

file_path = "D:/results/planarity2.txt"
column_values = np.loadtxt(file_path)

See if that helps.

Matt Pitkin
  • 3,989
  • 1
  • 18
  • 32