I have a folder(user) which contains 80 subfolders (1,2, 3,…, 80) and in each subfolder there are 2 text files (file1 and file2). file1 has 7 columns and file2 has 3 columns both without labels and are not in the same size. First column of file1 is time and the first and second columns of file2 are start_time and end_time. So, what I want to do is to:
- read file1 and file2 in each subfolder and compare whether the time in the first column of the file1 is in the range of (start_time,end_time) in the file2 or not.
- if yes, put the value in the third column of file2 in a new column in file1 corresponding to that time which is in the range and convert it to a pandas dataframe.
- if No, I mean if the time is not in the range (start_time,end_time),put NAN in the new created column.
- do this for all 2 text files in all folders(1,2,3,...80) so that at the end we will have 80 dataframes
- convert all the dataframes to one single pandas dataframe.
file1 is something like this:
1493974279325 1251148166327 417683620715 250.0 50.847060192 -0.134362797 66.307039766
1493974280326 1252150237681 417683620715 350.0 50.847057006 -0.134359581 105.778622992
1493984243830 1253153644973 417683620715 350.0 50.847054933 -0.134318363 158.247842792
1493984243840 1254156207993 417683620715 350.0 50.847051726 -0.134282482 158.247842792
1493974283335 1255160442889 417683620715 350.0 50.847050123 -0.134264542 158.247842792
1493974284338 1256162859035 417683620715 350.0 50.847049321 -0.134255572 158.247842792
1493974285340 1257165017889 417683620715 350.0 50.847048921 -0.134251086 158.247842792
1493974286343 1258168318930 417683620715 350.0 50.84704872 -0.134248844 158.247842792
1493974287347 1259171307992 417683620715 350.0 50.84704862 -0.134247723 158.247842792
1493974288351 1260175022576 417683620715 350.0 50.84704857 -0.134247162 158.247842792
1493974289352 1261177816325 417683620715 350.0 50.847048545 -0.134246882 158.247842792
1493984243890 1262179719971 417683620715 350.0 50.847048532 -0.134246741 158.247842792
1493984243900 1263182887158 417683620715 350.0 50.847048526 -0.134246671 158.247842792
file2 is something like this:
1488377142416 1488378192537 7
1488379212697 1488380967936 3
1488381613217 1488382948229 4
1488383667626 1488384747965 4
1488385047398 1488385633069 5
1488386203182 1488386877333 5
1488386952452 1488388272444 4
1488388482553 1488389532601 3
1488389863114 1488391843248 4
expected output would be something like this( what I want is a pandas dataframe),in this output file the last column is the new created column and the values come from the third column in file2. (this is just an example to show the expected output and times in the first column are not in the range of start time and end time)
1493974279325 1251148166327 417683620715 250.0 50.847060192 -0.134362797 66.307039766 7
1493974280326 1252150237681 417683620715 350.0 50.847057006 -0.134359581 105.778622992 7
1493984243830 1253153644973 417683620715 350.0 50.847054933 -0.134318363 158.247842792 7
1493984243840 1254156207993 417683620715 350.0 50.847051726 -0.134282482 158.247842792 4
1493974283335 1255160442889 417683620715 350.0 50.847050123 -0.134264542 158.247842792 4
1493974284338 1256162859035 417683620715 350.0 50.847049321 -0.134255572 158.247842792 4
1493974285340 1257165017889 417683620715 350.0 50.847048921 -0.134251086 158.247842792 4
1493974286343 1258168318930 417683620715 350.0 50.84704872 -0.134248844 158.247842792 4
1493974287347 1259171307992 417683620715 350.0 50.84704862 -0.134247723 158.247842792 4
1493974288351 1260175022576 417683620715 350.0 50.84704857 -0.134247162 158.247842792 3
1493974289352 1261177816325 417683620715 350.0 50.847048545 -0.134246882 158.247842792 3
1493984243890 1262179719971 417683620715 350.0 50.847048532 -0.134246741 158.247842792 NAN
1493984243900 1263182887158 417683620715 350.0 50.847048526 -0.134246671 158.247842792 NAN
what I have done:
#1
def read_file1_data(path):
location=pd.read_csv(path,delimiter='\t',header=None,names=['location'])
location=location['location'].str.split(expand=True)
location.columns=['Timestamp','Ignore2','Ignore3','accuracy(m)','latitude','longitude','altitude']
location.drop(['Ignore2','Ignore3','accuracy(m)','altitude'],axis=1,inplace=True)
location['Timestamp'] = location['Timestamp'].astype('int64')
return location
#2
def read_file2(path):
labels = pd.read_csv(path, skiprows=0, header=None,
infer_datetime_format=True, delim_whitespace=True)
# for clarity rename columns
labels.columns = ['start_time', 'end_time', 'label']
return labels
#3
def apply_labels(location, labels):
indices = labels['start_time'].searchsorted(location['Timestamp'], side='right') - 1
no_label = (indices < 0) | (location['Timestamp'].values >= labels['end_time'].iloc[indices].values)
location['label'] = labels['label'].iloc[indices].values
location['label'][no_label] = np.NaN
#4
def read_user(user_folder):
labels = None
location_files = glob.glob(os.path.join(user_folder,'Hips_Location.txt'))
df = pd.concat([read_file1_data(f) for f in location_files])
labels_file = os.path.join(user_folder, 'labels_track_main.txt')
if os.path.exists(labels_file):
labels = read_file2(labels_file)
apply_labels(df, labels)
else:
df['label'] = np.NAN
return df
def read_all_users(folder):
subfolders = os.listdir(folder)
dfs = []
for i, sf in enumerate(subfolders):
df = read_user(os.path.join(folder,sf))
dfs.append(df)
return pd.concat(dfs)
#final dataframe
DF = read_all_users('/content/drive/MyDrive/Sussex Trajectories/Data')
However, it returns jus "NAN" in the new created column:
Timestamp latitude longitude label
0 1496127155569 52.026602082620364 0.964491661294289 NaN
1 1496127157333 52.026602287 0.964491665 NaN
2 1496127158334 52.026603335 0.964496445 NaN
3 1496127159336 52.026602658 0.964503625 NaN
4 1496127160340 52.026600765 0.964518915 NaN
... ... ... ... ...
33156 1496349577578 50.846888356 -0.133469528 NaN
33157 1496349578581 50.84689128 -0.133483904 NaN
33158 1496349579583 50.84689199 -0.133497738 NaN
33159 1496349580587 50.846893418 -0.133511131 NaN
33160 1496349581590 50.846894132 -0.133517828 NaN
Any suggestion would be greatly appreciated.