1

I have a lot of time series data. Almost 3 GB of csv files. The dimensions are 50k columns with 6000 rows. Now I need to process them row by row. They are time ordered and its important that for each row, I look at each column.

Would importing this in to pandas as a pivot table and iterating them over row by row efficient? Any suggestions?

user1234440
  • 22,521
  • 18
  • 61
  • 103
  • Pandas is basically an array of columns. Because columns are expected to all have the same number of rows, defining a new column and calculating its value for each row based on a formula involving other columns at that same row does not typically require the programmer to write explicit row by row iteration. So maybe you don't need to iterate row by row. Perhaps related: http://stackoverflow.com/questions/12376863/adding-calculated-columns-to-a-dataframe-in-pandas – Paul Apr 24 '15 at 04:38
  • Assuming your PC has sufficient RAM (I guess at least 8gb? if the CSV is 3gb). Anyway, this is basic pandas stuff. Start by reading into pandas (first x rows if you don't have enough memory for the full file) and go from there. Really not possible to answer such a general question as you are asking. Just start with the basics and ask specific questions as you run into specific problems. – JohnE Apr 24 '15 at 14:53

2 Answers2

0

It depends on your system capabilities and how you are processing the data. How do processing each row? What intermediate values are stored? How much history needs to be retained? Etc.

You can import the DataFrame and then use iterrows, but it is not terribly efficient because:

  • A new Pandas Series object has to be created for each row
  • it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames).

In general, it is best to just read the entire table and then process it if your hardware is not a constraint.


df = pd.DataFrame(np.random.randn(6000, 50000))
>>> df.shape
(6000, 50000)

>>> df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6000 entries, 0 to 5999
Columns: 50000 entries, 0 to 49999
dtypes: float64(50000)
memory usage: 2.2 GB

df.to_csv(filename)

The documents describe how to iterate through large files in chunks.

reader1 = pd.read_csv(filename, chunksize=1)
reader2 = pd.read_csv(filename, chunksize=10)
reader3 = pd.read_csv(filename, chunksize=100)
reader4 = pd.read_csv(filename, chunksize=1000)

# Chunksize = 1
%time for row in reader1:temp = row
CPU times: user 2h 11min 27s, sys: 1min 22s, total: 2h 12min 49s
Wall time: 2h 12min 39s

# Chunksize = 10
%time for row in reader2:temp = row
CPU times: user 14min 38s, sys: 11.9 s, total: 14min 50s
Wall time: 14min 50s    

# Chunksize = 100
%time for row in reader3:temp = row
CPU times: user 5min 17s, sys: 6.97 s, total: 5min 24s
Wall time: 5min 24s

# Chunksize = 1000
%time for row in reader3:temp = row
CPU times: user 4min 13s, sys: 6.8 s, total: 4min 20s
Wall time: 4min 20s

# Reading the whole file.
%time df2 = pd.read_csv(filename)
CPU times: user 4min 11s, sys: 8.4 s, total: 4min 19s
Wall time: 4min 19s
Alexander
  • 105,104
  • 32
  • 201
  • 196
0

Row by row. Pandas is not the ideal tool for this. I would suggest you look into Map/Reduce. It is designed for exactly this. Streaming is the key to row by row processing.

firelynx
  • 30,616
  • 9
  • 91
  • 101
  • Map/Reduce is an algorithm implemented in a very wide variaty of tools. Some of them extremly overbloated for what most people need. For simple stuff in python, I would recommend looking at MRJob https://pythonhosted.org/mrjob/ If you want more data pipelining integration, maybe pyspark https://spark.apache.org/docs/latest/programming-guide.html is more for you, but it requires a bit more setup. I hold workshops on how to build map/reduce implementations in bash, but the chance that you are in Berlin is quite slim I guess. – firelynx Apr 28 '15 at 07:10