How to merge two programs with scheduled execution

Question

I am trying to merge two programs or write a third program that will call these two programs as function. They are supposed to run one after the other and after interval of certain time in minutes. something like a make file which will have few more programs included later. I am not able to merge them nor able to put them into some format that will allow me to call them in a new main program.

program_master_id.py picks the *.csv file from a folder location and after computing appends the master_ids.csv file in another location of folder.

Program_master_count.py divides the count with respect to the count ofIds in the respective timeseries.

Program_1 master_id.py

import pandas as pd
import numpy as np

# csv file contents
# Need to change to path as the Transition_Data has several *.CSV files

csv_file1 = 'Transition_Data/Test_1.csv' 
csv_file2 = '/Transition_Data/Test_2.csv'

#master file to be appended only

master_csv_file = 'Data_repository/master_lac_Test.csv'

csv_file_all = [csv_file1, csv_file2]

# read csv into df using list comprehension
# I use buffer here, replace stringIO with your file path

df_all = [pd.read_csv(csv_file) for csv_file in csv_file_all]

# processing
# =====================================================
# concat along axis=0, outer join on axis=1
merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')

# custom function to handle/merge duplicates on Ids (axis=0)
def apply_func(group):
    return group.fillna(method='ffill').iloc[-1]

# remove Ids duplicates
merged_unique = merged.groupby(level='Ids').apply(apply_func)

# do the subtraction

df_master = pd.read_csv(master_csv_file, index_col=['Ids']).sort_index()

# select matching records and horizontal concat
df_matched = pd.concat([df_master,merged_unique.reindex(df_master.index)], axis=1)

# use broadcasting
df_matched.iloc[:, 1:] = df_matched.iloc[:, 1:].sub(df_matched.iloc[:, 0], axis=0)

print(df_matched)

Program_2 master_count.py #This does not give any error nor gives any output.

import pandas as pd
import numpy as np

csv_file1 = '/Data_repository/master_lac_Test.csv'
csv_file2 = '/Data_repository/lat_lon_master.csv'

df1 = pd.read_csv(csv_file1).set_index('Ids')

# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()

# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])

# do the division by number of occurence of each Ids 
# and add column 00:00:00
def my_func(group):
    num_obs = len(group)
    # process with column name after 00:30:00 (inclusive)
    group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
    return group

result = temp.groupby(level='Ids').apply(my_func)

I am trying to write a main program that will call master_ids.py first and then master_count.py. Is their a way to merge both in one program or write them as functions and call those functions in a new program ? Please suggest.

It would help to shorten this example a little bit so that it would be faster and easier to read... But I think you have 2 basic options. (1) enclose each in a function and then use `sleep()` from the `time` module to call the second function every 5 minutes, or (2) use a shell script (e.g. bash under linux) to call each program separately and control how often to call. That's probably the better way but will depend on your platform (mac/win/unix) and choice of scripting language. — JohnE, Jul 12 '15 at 14:34
Well, this is not an area of expertise for me at all, so I can't help much. You might want to re-post this as a question with 'bash' and whatever AWS-related tags are appropriate. Although if you do that, I would edit the question to simplify the python pieces as much as possible. — JohnE, Jul 12 '15 at 17:50

score 1 · Accepted Answer · edited May 23 '17 at 10:28

Okey, lets say you have program1.py:

import pandas as pd
import numpy as np

def main_program1():
    csv_file1 = 'Transition_Data/Test_1.csv' 
    ...
    return df_matched

And then program2.py:

import pandas as pd
import numpy as np

def main_program2():
    csv_file1 = '/Data_repository/master_lac_Test.csv'
    ...
    result = temp.groupby(level='Ids').apply(my_func)
    return result

You can now use these in a separate python program, say main.py

import time
import program1 # imports program1.py
import program2 # imports program2.py

df_matched = program1.main_program1()
print(df_matched)
# wait
min_wait = 1
time.sleep(60*min_wait)
# call the second one
result = program2.main_program2()

There are lots of ways to 'improve' these, but hopefully this will show you the gist. I would in particular recommend you use the What does if __name__ == "__main__": do? in each of the files, so that they can easily be executed from the command-line or called from python.

Another option is a shell script, which for your 'master_id.py' and 'master_count.py' become (in its simplest form)

python master_id.py
sleep 60
python master_count.py

saved in 'main.sh' this can be executed as

sh main.sh

I could use some help with the current issues that I am stuck up with. — Sitz Blogz, May 17 '16 at 10:18
1. http://stackoverflow.com/questions/37224354/separate-keywords-and-mentions-from-dataset/37224489#37224489 — Sitz Blogz, May 17 '16 at 10:18
2. http://stackoverflow.com/questions/37248677/how-to-fill-missing-geo-location-in-datasets/37255345#37255345 — Sitz Blogz, May 17 '16 at 10:18
3. http://stackoverflow.com/questions/37255647/clean-one-column-from-long-and-big-data-set — Sitz Blogz, May 17 '16 at 10:19

How to merge two programs with scheduled execution

1 Answers1

Linked