1

I've looked at this response to try and get numpy to print the full array rather than a summarized view, but it doesn't seem to be working.

I have a CSV with named headers. Here are the first five rows

v0  v1  v2  v3  v4
1001    5529    24  56663   16445
1002    4809    30.125  49853   28069
1003    407 20  28462   8491
1005    605 19.55   75423   4798
1007    1607    20.26   79076   12962

I'd like to read in the data and be able to view it fully. I tried doing this:

import numpy as np
np.set_printoptions(threshold=np.inf)

main_df2=np.genfromtxt('file location', delimiter=",")
main_df2[0:3,:]

However this still returns the truncated array, and the performance seems greatly slowed. What am I doing wrong?

Community
  • 1
  • 1
vashts85
  • 1,069
  • 3
  • 14
  • 28
  • 1
    what does that last line show? Thats only 3 rows and 5 columns, if the `genfromtxt` is right. – hpaulj Feb 24 '17 at 15:57

3 Answers3

2

OK, in a regular Python session (I usually use Ipython instead), I set the print options, and made a large array:

>>> np.set_printoptions(threshold=np.inf, suppress=True)
>>> x=np.random.rand(25000,5)

When I execute the next line, it spends about 21 seconds formatting the array, and then writes the resulting string to the screen (with more lines than fit the terminal's window buffer).

>>> x

This is the same as

>>> print(repr(x))

The internal storage for x is a buffer of floats (which you can 'see' with x.tostring(). To print x it has to format it, create a multiline string that contains a print representation of each number, all 125000 of them. The result of repr(x) is a string 1850000 char long, 25000 lines. This is what takes 21 seconds. Displaying that on the screen is just limited by the terminal scroll speed.

I haven't looked at the details, but I think the numpy formatting is mostly written in Python, not compiled. It's designed more for flexibility than speed. It's normal to want to see 10-100 lines of an array. 25000 lines is an unusual case.

Somewhat curiously, writing this array as a csv is fast, with a minimal delay:

>>> np.savetxt('test.txt', x, fmt='%10f', delimiter=',')

And I know what savetxt does - it iterates on rows, and does a file write

f.write(fmt % tuple(row))

Evidently all the bells-n-whistles of the regular repr are expensive. It can summarize, it can handle many dimensions, it can handle complicated dtypes, etc. Simply formatting each row with a known fixed format is not the time consuming step.

Actually that savetxt route might be more useful, as well as fast. You can control the display format, and you can view the resulting text file in an editor or terminal window at your leisure. You won't be limited by the scroll buffer of your terminal window. But how will this savetxt file be different from the original csv?

hpaulj
  • 221,503
  • 14
  • 230
  • 353
1

I'm surprised you get an array at all as your example does not use ',' as delimiter. But maybe you forgot to included commas in your example file.

I would use the DataFrame functionality of pandas if I work with csv data. It uses numpy under the hood, so all numpy operation work on pandas DataFrames.

Pandas has many tricks for operating with table like data.

import pandas as pd

df = pd.read_csv('nothing.txt')
#==============================================================================
# next line remove blanks from the column names
#==============================================================================
df.columns = [name.strip(' ') for name in df.columns]

pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

print(df)
Henning
  • 95
  • 1
  • 1
  • 7
0

When I copied and pasted it the data here it was open in Excel, but the file is a CSV.

I'm doing a class exercise and we have to use numpy. One thing I noticed was that the results were quite illegible thanks for the scientific notation, so I did the following and things are much smoother:

np.set_printoptions(threshold=100000, suppress=True)

The suppress statement saved me a lot of formatting. The performance does suffer a lot when I change the threshold to something like 'nan' or inf, and I'm not sure why.

vashts85
  • 1,069
  • 3
  • 14
  • 28
  • How big is this file? Pages and pages of rows? – hpaulj Feb 24 '17 at 15:35
  • 25,000 rows, so I wouldn't expect it to be slow in Python? Or is that typical in Python? My other programming experience is in R. – vashts85 Feb 24 '17 at 17:51
  • I can't imagine trying to print (write to the screen) 25000 rows of anything! I might pipe it to less/more and scroll through looking at selected rows. But the whole thing? – hpaulj Feb 24 '17 at 18:02
  • Sure, I can agree to that. I guess I should just slice a few rows? Is there a a command to randomly select some rows? – vashts85 Feb 27 '17 at 13:28