pandas df.at utterly slow in some lines

Question

I've got a .txt logfile with IMU sensor measurements which need to be parsed to a .CSV file. Accelerometer, gyroscope have 500Hz ODR (output data rate) magnetomer 100Hz, gps 1Hz and baro 1Hz. Wi-fi, BLE, pressure, light etc. is also logged but most is not needed. The smartphone App doesn't save all measurements sequentially.

It takes 1000+ seconds to parse a file of 200k+ lines to a pandas DataFrame sort the DataFrame on the timestamps and save it as a csv file.

When assigning values of sensor measurements at a coordinate (Row=Timestamp, column=sensor measurement) in the DataFrame, some need ~40% of the runtime, while others take +- 0.1% of the runtime.

What could be the reason for this? It shouldn't take a 1000+ seconds..

What is in the logfile:

ACCE;AppTimestamp(s);SensorTimestamp(s);Acc_X(m/s^2);Acc_Y(m/s^2);Acc_Z(m/s^2);Accuracy(integer)
GYRO;AppTimestamp(s);SensorTimestamp(s);Gyr_X(rad/s);Gyr_Y(rad/s);Gyr_Z(rad/s);Accuracy(integer)
MAGN;AppTimestamp(s);SensorTimestamp(s);Mag_X(uT);;Mag_Y(uT);Mag_Z(uT);Accuracy(integer)
MAGN;AppTimestamp(s);SensorTimestamp(s);Mag_X(uT);;Mag_Y(uT);Mag_Z(uT);Accuracy(integer)
PRES;AppTimestamp(s);SensorTimestamp(s);Pres(mbar);Accuracy(integer)
LIGH;AppTimestamp(s);SensorTimestamp(s);Light(lux);Accuracy(integer)
PROX;AppTimestamp(s);SensorTimestamp(s);prox(?);Accuracy(integer)
HUMI;AppTimestamp(s);SensorTimestamp(s);humi(Percentage);Accuracy(integer)
TEMP;AppTimestamp(s);SensorTimestamp(s);temp(Celsius);Accuracy(integer)
AHRS;AppTimestamp(s);SensorTimestamp(s);PitchX(deg);RollY(deg);YawZ(deg);RotVecX();RotVecY();RotVecZ();Accuracy(int)
GNSS;AppTimestamp(s);SensorTimeStamp(s);Latit(deg);Long(deg);Altitude(m);Bearing(deg);Accuracy(m);Speed(m/s);SatInView;SatInUse
WIFI;AppTimestamp(s);SensorTimeStamp(s);Name_SSID;MAC_BSSID;RSS(dBm);
BLUE;AppTimestamp(s);Name;MAC_Address;RSS(dBm);
BLE4;AppTimestamp(s);MajorID;MinorID;RSS(dBm);
SOUN;AppTimestamp(s);RMS;Pressure(Pa);SPL(dB);
RFID;AppTimestamp(s);ReaderNumber(int);TagID(int);RSS_A(dBm);RSS_B(dBm);
IMUX;AppTimestamp(s);SensorTimestamp(s);Counter;Acc_X(m/s^2);Acc_Y(m/s^2);Acc_Z(m/s^2);Gyr_X(rad/s);Gyr_Y(rad/s);Gyr_Z(rad/s);Mag_X(uT);;Mag_Y(uT);Mag_Z(uT);Roll(deg);Pitch(deg);Yaw(deg);Quat(1);Quat(2);Quat(3);Quat(4);Pressure(mbar);Temp(Celsius)
IMUL;AppTimestamp(s);SensorTimestamp(s);Counter;Acc_X(m/s^2);Acc_Y(m/s^2);Acc_Z(m/s^2);Gyr_X(rad/s);Gyr_Y(rad/s);Gyr_Z(rad/s);Mag_X(uT);;Mag_Y(uT);Mag_Z(uT);Roll(deg);Pitch(deg);Yaw(deg);Quat(1);Quat(2);Quat(3);Quat(4);Pressure(mbar);Temp(Celsius)
POSI;Timestamp(s);Counter;Latitude(degrees); Longitude(degrees);floor ID(0,1,2..4);Building ID(0,1,2..3)

A part of the RAW .txt logfile:

MAGN;1.249;343268.933;2.64000;-97.50000;-69.06000;0
GYRO;1.249;343268.934;0.02153;0.06943;0.09880;3
ACCE;1.249;343268.934;-0.24900;0.53871;9.59625;3 GNSS;1.250;1570711878.000;52.225976;5.174543;58.066;175.336;3.0;0.0;23;20
ACCE;1.253;343268.936;-0.26576;0.52674;9.58428;3
GYRO;1.253;343268.936;0.00809;0.06515;0.10002;3
ACCE;1.253;343268.938;-0.29450;0.49561;9.57710;3
GYRO;1.253;343268.938;0.00015;0.06088;0.10613;3
PRES;1.253;343268.929;1011.8713;3
GNSS;1.254;1570711878.000;52.225976;5.174543;58.066;175.336;3.0;0.0;23;20
ACCE;1.255;343268.940;-0.29450;0.49801;9.57710;3
GYRO;1.255;343268.940;-0.00596;0.05843;0.10979;3
ACCE;1.260;343268.942;-0.30647;0.50280;9.55795;3
GYRO;1.261;343268.942;-0.01818;0.05721;0.11529;3
MAGN;1.262;343268.943;2.94000;-97.74000;-68.88000;0

fileContent are the strings of the txt file as showed above.

Piece of the code:

def parseValues(line):
    valArr = []
    valArr = np.fromstring(line[5:], dtype=float, sep=";")

    return (valArr)

i = 0
while i < len(fileContent):
    if (fileContent[i][:4] == "ACCE"):
        vals = parseValues(fileContent[i])
        idx = vals[1] - initialSensTS
        df.at[idx, 'ax'] = vals[2]
        df.at[idx, 'ay'] = vals[3]
        df.at[idx, 'az'] = vals[4]
        df.at[idx, 'accStat'] = vals[5]
        i += 1

The code works, but it's utterly slow at some of the df.at[idx, 'xx'] lines.

See Line # 28.

Line profiler output:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
22         1          1.0      1.0      0.0      i = 0
23    232250     542594.0      2.3      0.0      while i < len(fileContent):
24    232249  294337000.0   1267.3     23.8          update_progress(i / len(fileContent))
25    232249     918442.0      4.0      0.1          if (fileContent[i][:4] == "ACCE"):
26     54602    1584625.0     29.0      0.1              vals = parseValues(fileContent[i])
27     54602     316968.0      5.8      0.0              idx = vals[1] - initialSensTS
28     54602  504189480.0   9233.9     40.8              df.at[idx, 'ax'] = vals[2]
29     54602    8311109.0    152.2      0.7              df.at[idx, 'ay'] = vals[3]
30     54602    4901983.0     89.8      0.4              df.at[idx, 'az'] = vals[4]
31     54602    4428239.0     81.1      0.4              df.at[idx, 'accStat'] = vals[5]
32     54602     132590.0      2.4      0.0              i += 1

What is the issue with using the built in `I/O` tools of pandas to just read this file into a DataFrame from the start? — ALollz, Oct 31 '19 at 20:18
pd.read_csv is giving `ParserError: Error tokenizing data. ` because the GNSS rows do not have the same amount of columns compared to ACCE, GYRO etc. — alehanderoo, Oct 31 '19 at 20:55
How about something like [this](https://stackoverflow.com/a/55189021/3282436)? — 0x5453, Oct 31 '19 at 21:12
I'll second the suggestions in the link that @0x5453 shared, particularly this answer: https://stackoverflow.com/a/55189021/11301900. — AMC, Nov 02 '19 at 03:30
A few notes: Variables should generally follow the `lower_case_with_underscores` style, not `camelCase`. I see no reason to use numpy to parse a string into an array, and then manually index that array to retrieve the elements. If I understand your `while` loop correctly, it only increments when `fileContent[i][:4] == "ACCE"`. If `fileContent[i][:4] != "ACCE"`, the counter does not change, which means that `fileContent[i][:4]` _still_ isn't equal to `"ACCE"`. The result should be an infinite loop, no? I don't understand why you aren't iterating over the file contents with a `for` loop. — AMC, Nov 02 '19 at 03:49
I tried a few things out with your data, and I'm starting to suspect that pandas may not be the tool for the job. — AMC, Nov 02 '19 at 03:50
Or not, sorry. I may have found something viable. I'm exhausted though, so I will only be able to post an answer tomorrow. — AMC, Nov 02 '19 at 04:02
I hadn't noticed earlier, but some example output would be nice. — AMC, Nov 03 '19 at 00:42

score 0 · Answer 1 · answered Nov 03 '19 at 00:54

This doesn't address the part of your question about sorting timestamps etc, but should be an efficient replacement for your 'ACCE' parsing code.

import pandas as pd
import collections as colls

logs_file_path = '../resources/imu_logs_raw.txt'

msmt_type_dict = colls.defaultdict(list)

with open(logs_file_path, 'r') as file_1:
    for line in file_1:
        curr_measure_type, *rest_str = line.split(';')
        rest_str[-1] = rest_str[-1].strip()
        msmt_type_dict[curr_measure_type].append(rest_str)

acce_df = pd.DataFrame(data=msmt_type_dict['ACCE'], columns=['app_timestamp', 'sensor_timestamp', 'acc_x', 'acc_y', 'acc_z', 'accuracy'])

If you can provide some more information/context I would love to take a look at the timestamp sorting aspect.

pandas df.at utterly slow in some lines

What is in the logfile:

A part of the RAW .txt logfile:

Piece of the code:

Line profiler output:

1 Answers1