2

The variable "data" in the code below contains hundreds of execution results from querying a database. Each execution result is one day of data containing roughly 7,000 rows of data (columns are timestamp, and value). I append each day to each other resulting in several million rows of data (these hundreds of appends take a long time). After I have the complete data set for one sensor I store this data as a column in the unitdf DataFrame, I then repeat the above process for each sensor and merge them all into the unitdf DataFrame.

Both the append and merge are costly operations I believe. The only possible solution I may have found is splitting up each column into lists and once all data is added to the list bring all the columns together into a DataFrame. Any suggestions to speed things up?

i = 0
for sensor_id in sensors: #loop through each of the 20 sensors
    #prepared statement to query Cassandra
    session_data = session.prepare("select  timestamp, value from measurements_by_sensor where unit_id = ? and sensor_id = ? and date = ? ORDER BY timestamp ASC")
    #Executing prepared statement over a range of dates    
    data = execute_concurrent(session, ((session_data, (unit_id, sensor_id, date)) for date in dates), concurrency=150, raise_on_first_error=False)

    sensordf = pd.DataFrame()
    #Loops through the execution results and appends all successful executions that contain data
    for (success, result) in data:
        if success:
          sensordf = sensordf.append(pd.DataFrame(result.current_rows))

    sensordf.rename(columns={'value':sensor_id}, inplace=True) 
    sensordf['timestamp'] = pd.to_datetime(sensordf['timestamp'], format = "%Y-%m-%d %H:%M:%S", errors='coerce')
    if i == 0:
        i+=1
        unitdf = sensordf
    else:
        unitdf = unitdf.merge(sensordf, how='outer')
red79phoenix
  • 91
  • 3
  • 14
  • 2
    "Pandas DataFrame append slow when appending hundreds of DataFrames with thousands of rows each".....well yea – SuperStew Sep 19 '18 at 19:34
  • 3
    Use `pd.concat` That is what it is for. Gather them up in a list or dictionary then `pd.concat(gathered_up_dataframes)`. There are many dups to this question. – piRSquared Sep 19 '18 at 19:41
  • Thank you @piRSquared, sorry for the duplicate question. I was thinking concat was only used for two dataframes not lists or dictionaries of them. – red79phoenix Sep 19 '18 at 19:48
  • Don't worry about the duplicate, it doesn't hurt me (-: I mention it mostly for others who may want to look it up and I could hammer it. [See my post on concat](https://stackoverflow.com/a/49620539/2336654) – piRSquared Sep 19 '18 at 19:53

0 Answers0