3

Recently I learned that if we want to manipulate the data in a CSV file from Excel, we need to transform it first into an ndarray with NumPy (please correct me if what I just learned is wrong).

While knowing about that, I also learned how to make plot with matplotlib. I saw the simple code to display a plot with matplotlib somewhere and the writer didn't transform it into an ndarray, he/she just simply displayed it with using row[0] and row[1].

Why didn't he/she transform it into a NumPy ndarray first? And how can I tell when should I turn CSV file into an ndarray?

Matt Hall
  • 7,614
  • 1
  • 23
  • 36
random student
  • 683
  • 1
  • 15
  • 33

2 Answers2

3

It's really hard to say what this other person was doing to make their plot without seeing their code, but probably the data was already in memory as a Python object. You can only make a plot in matplotlib using data that you have in memory, e.g. from a Python list, or from a NumPy array, or maybe from a Pandas DataFrame, or some other object.

As you probably know, CSV is a file format. It's not a Python or NumPy object. In order to make a plot from the data, you must use some kind of file-reading code to read the file into memory. Then you can do things with it in Python.

People do this file reading in all sorts of different ways, depending on their ultimate goal. For example, you can use NumPy's genfromtxt() function, as mentioned by a commenter and as described in this Stack Overflow question. So you might do this, for example:

data = np.genfromtxt("mydata.csv", delimiter=',')

A note about pandas

A lot of people really like Pandas for handling data from CSVs. This is because a CSV can have all sorts of different data in it. For example, it might have a column of strings, a column of floats, a column of dates, etc. NumPy is great for datasets in which every element is of the same type (e.g. all floats representing the same thing, like measurements of temperature on a surface, say). But it's not ideal for datasets in which you have lots of different kinds of measurement. That's what Pandas is for. Pandas is also great for reading and writing CSV and even XLS files.

Matt Hall
  • 7,614
  • 1
  • 23
  • 36
1

Your data does not have to be an ndarray in order to plot it with matplotlib. You can read in your data as a list and it will plot all the same as also mentioned by kwinkunks. How you read in your data matters which is a step you really need to worry about first!

To answer your question, if you really want to manipulate data and not just plot it then using a numpy array is the way to go. The advantage of using numpy arrays is that you can easily compute new variables and condition the data you have.

Take the following example. On the left you can plot the data as a list but you cannot manipulate the data and subset points. On the right side if your data is a numpy array you can easily condition the data say take only x values greater than 4 and plot them as red.

import matplotlib.pyplot as plt
import numpy as np

#Declare some data as a list
x = [2,5,4,3,6,2,6,10,1,0,.5]
y = [7,2,8,1,4,5,6,5,4,5,2]

#Make that same data a numpy array
x_array = np.array([2,5,4,3,6,2,6,10,1,0,.5])
y_array = np.array([7,2,8,1,4,5,6,5,4,5,2])

#Declare a figure with 2 subplots
fig = plt.figure(figsize=(12,6))
ax1 = plt.subplot(121)
ax2 = plt.subplot(122)

#Plot only the list
ax1.scatter(x,y) 

#Plot only the list again on the second subplot
ax2.scatter(x,y) 

#Index the data based on condition and plot those points as red
ax2.scatter(x_array[x_array>3],y_array[x_array>3],c='red')

plt.show()

enter image description here

BenT
  • 3,172
  • 3
  • 18
  • 38