0

I have climate sensor data that is generated like this from a sensor hub and stored in a text file periodically:

{"Time":1541203508.45,"Tc":25.4,"Hp":33}
{"Time":1541203508.45,"Tc":25.2,"Hp":32}
{"Time":1541203508.45,"Tc":25.1,"Hp":31}
{"Time":1541203508.45,"Tc":25.2,"Hp":33}

I'm doing a lot of list lookups (binning) in a for loop like this:

#generate sample data
sensor_data={x:{"Time":15e6+x,"Tc":random.randint(20e3,50e3)/1000.0,"Hp":random.randint(0,100e3)/1000.0} for x in range(int(1e5))}

#process data
for i in sensor_data:
    #we'd normally do a json.loads(data_line) if reading from the file
    sensor_data[i]['H']=['V_Dry','Dry','Normal','Humid','V_Humid','ERR'][int(sensor_data[i]['Hp']//20)]
    sensor_data[i]['T']=['V_Cold','Cold','Normal','Hot','V_Hot','ERR'][int(sensor_data[i]['Tc']//10)]
    #.... And so on for other sensors etc

input:

{0: {'Hp': 20.514, 'Tc': 43.92, 'Time': 15000000.0},
 1: {'Hp': 59.332, 'Tc': 35.592, 'Time': 15000001.0},
 2: {'Hp': 19.49, 'Tc': 25.813, 'Time': 15000002.0},
 3: {'Hp': 78.644, 'Tc': 48.07, 'Time': 15000003.0},
 4: {'Hp': 3.967, 'Tc': 35.058, 'Time': 15000004.0}}

output:

{0: {'H': 'Dry', 'Hp': 20.514, 'T': 'V_Hot', 'Tc': 43.92, 'Time': 15000000.0},
 1: {'H': 'Normal', 'Hp': 59.332, 'T': 'Hot', 'Tc': 35.592, 'Time': 15000001.0},
 2: {'H': 'V_Dry', 'Hp': 19.49, 'T': 'Normal', 'Tc': 25.813, 'Time': 15000002.0},
 3: {'H': 'Humid', 'Hp': 78.644, 'T': 'V_Hot', 'Tc': 48.07, 'Time': 15000003.0},
 4: {'H': 'V_Dry', 'Hp': 3.967, 'T': 'Hot', 'Tc': 35.058, 'Time': 15000004.0}}

How can I speed up this translation? We usually have 100MB-1GB files of this data that we have to read from a file.

I've tried a few things already:

I won't multiprocess reading the file because that would cause too many fetches, which I believe is incorrect.

I've tried multiprocessing chunks of the datalines using a function that split the text file into chunks, and the multiprocessing overhead seemed to be too much.

I've tried 4 processes each consuming a queue and 4 processes each consuming different queue using a manager.list object passed to the processes.

In each case, just running the simple 'for' loop over the data is timed faster overall. What do I do to multiprocess/thread/anything else to be faster than the serial data processing?

The data does not have to be in any order because we have the timestamps that we can use to sort later. We have a 4 core i7 at our disposal. We use ujson which seems to be faster.

azazelspeaks
  • 5,727
  • 2
  • 22
  • 39

1 Answers1

0

If you are doing IO operation you must use multi threading not multi processing. If you have CPU consume operation you must use multi threading. Since your operation is reading your data from file, you must use multi threading instead of multi processing. For more information check this and this answer.

Saber Solooki
  • 1,182
  • 1
  • 15
  • 34