Python: How to read and compare text files in 80 separate folders and merge them into one single pandas dataframe based on a condition?

Question

I have a folder(user) which contains 80 subfolders (1,2, 3,…, 80) and in each subfolder there are 2 text files (file1 and file2). file1 has 7 columns and file2 has 3 columns both without labels and are not in the same size. First column of file1 is time and the first and second columns of file2 are start_time and end_time. So, what I want to do is to:

read file1 and file2 in each subfolder and compare whether the time in the first column of the file1 is in the range of (start_time,end_time) in the file2 or not.
if yes, put the value in the third column of file2 in a new column in file1 corresponding to that time which is in the range and convert it to a pandas dataframe.
if No, I mean if the time is not in the range (start_time,end_time),put NAN in the new created column.
do this for all 2 text files in all folders(1,2,3,...80) so that at the end we will have 80 dataframes
convert all the dataframes to one single pandas dataframe.

file1 is something like this:


    1493974279325 1251148166327 417683620715 250.0 50.847060192 -0.134362797 66.307039766
    1493974280326 1252150237681 417683620715 350.0 50.847057006 -0.134359581 105.778622992
    1493984243830 1253153644973 417683620715 350.0 50.847054933 -0.134318363 158.247842792
    1493984243840 1254156207993 417683620715 350.0 50.847051726 -0.134282482 158.247842792
    1493974283335 1255160442889 417683620715 350.0 50.847050123 -0.134264542 158.247842792
    1493974284338 1256162859035 417683620715 350.0 50.847049321 -0.134255572 158.247842792
    1493974285340 1257165017889 417683620715 350.0 50.847048921 -0.134251086 158.247842792
    1493974286343 1258168318930 417683620715 350.0 50.84704872 -0.134248844 158.247842792
    1493974287347 1259171307992 417683620715 350.0 50.84704862 -0.134247723 158.247842792
    1493974288351 1260175022576 417683620715 350.0 50.84704857 -0.134247162 158.247842792
    1493974289352 1261177816325 417683620715 350.0 50.847048545 -0.134246882 158.247842792
    1493984243890 1262179719971 417683620715 350.0 50.847048532 -0.134246741 158.247842792
    1493984243900 1263182887158 417683620715 350.0 50.847048526 -0.134246671 158.247842792

file2 is something like this:


1488377142416 1488378192537 7
1488379212697 1488380967936 3
1488381613217 1488382948229 4
1488383667626 1488384747965 4
1488385047398 1488385633069 5
1488386203182 1488386877333 5
1488386952452 1488388272444 4
1488388482553 1488389532601 3
1488389863114 1488391843248 4

expected output would be something like this( what I want is a pandas dataframe),in this output file the last column is the new created column and the values come from the third column in file2. (this is just an example to show the expected output and times in the first column are not in the range of start time and end time)


      1493974279325 1251148166327 417683620715 250.0 50.847060192 -0.134362797 66.307039766   7
        1493974280326 1252150237681 417683620715 350.0 50.847057006 -0.134359581 105.778622992  7
        1493984243830 1253153644973 417683620715 350.0 50.847054933 -0.134318363 158.247842792  7
        1493984243840 1254156207993 417683620715 350.0 50.847051726 -0.134282482 158.247842792  4
        1493974283335 1255160442889 417683620715 350.0 50.847050123 -0.134264542 158.247842792  4
        1493974284338 1256162859035 417683620715 350.0 50.847049321 -0.134255572 158.247842792  4
        1493974285340 1257165017889 417683620715 350.0 50.847048921 -0.134251086 158.247842792  4
        1493974286343 1258168318930 417683620715 350.0 50.84704872 -0.134248844 158.247842792  4
        1493974287347 1259171307992 417683620715 350.0 50.84704862 -0.134247723 158.247842792  4
        1493974288351 1260175022576 417683620715 350.0 50.84704857 -0.134247162 158.247842792  3
        1493974289352 1261177816325 417683620715 350.0 50.847048545 -0.134246882 158.247842792  3
        1493984243890 1262179719971 417683620715 350.0 50.847048532 -0.134246741 158.247842792 NAN
        1493984243900 1263182887158 417683620715 350.0 50.847048526 -0.134246671 158.247842792 NAN

what I have done:


       #1
    def read_file1_data(path):
        location=pd.read_csv(path,delimiter='\t',header=None,names=['location'])
        location=location['location'].str.split(expand=True)
        location.columns=['Timestamp','Ignore2','Ignore3','accuracy(m)','latitude','longitude','altitude']
        location.drop(['Ignore2','Ignore3','accuracy(m)','altitude'],axis=1,inplace=True)
        
        location['Timestamp'] = location['Timestamp'].astype('int64')
    
        return location
    #2
    def read_file2(path):
        labels = pd.read_csv(path, skiprows=0, header=None,
                            infer_datetime_format=True, delim_whitespace=True)
    
        # for clarity rename columns
        labels.columns = ['start_time', 'end_time', 'label']
        
        return labels
    #3
    def apply_labels(location, labels):
        indices = labels['start_time'].searchsorted(location['Timestamp'], side='right') - 1
        no_label = (indices < 0) | (location['Timestamp'].values >= labels['end_time'].iloc[indices].values)
        location['label'] = labels['label'].iloc[indices].values
        location['label'][no_label] = np.NaN
    
    #4
    def read_user(user_folder):
        labels = None
    
        location_files = glob.glob(os.path.join(user_folder,'Hips_Location.txt'))
        df = pd.concat([read_file1_data(f) for f in location_files])
    
        labels_file = os.path.join(user_folder, 'labels_track_main.txt')
        if os.path.exists(labels_file):
            labels = read_file2(labels_file)
            apply_labels(df, labels)
        else:
            df['label'] = np.NAN
    
        return df
    
    def read_all_users(folder):
        subfolders = os.listdir(folder)
        dfs = []
        for i, sf in enumerate(subfolders):
            df = read_user(os.path.join(folder,sf))
            dfs.append(df)
        return pd.concat(dfs)
    
    #final dataframe
    DF = read_all_users('/content/drive/MyDrive/Sussex Trajectories/Data')

However, it returns jus "NAN" in the new created column:


    Timestamp   latitude    longitude   label
    0   1496127155569   52.026602082620364  0.964491661294289   NaN
    1   1496127157333   52.026602287    0.964491665 NaN
    2   1496127158334   52.026603335    0.964496445 NaN
    3   1496127159336   52.026602658    0.964503625 NaN
    4   1496127160340   52.026600765    0.964518915 NaN
    ... ... ... ... ...
    33156   1496349577578   50.846888356    -0.133469528    NaN
    33157   1496349578581   50.84689128 -0.133483904    NaN
    33158   1496349579583   50.84689199 -0.133497738    NaN
    33159   1496349580587   50.846893418    -0.133511131    NaN
    33160   1496349581590   50.846894132    -0.133517828    NaN

Any suggestion would be greatly appreciated.

score 1 · Answer 1 · answered Nov 21 '22 at 12:33

1

Duplicate of this post. Short answer : give the same name, say ts, to the timestamp column in both dataframes, then use merge method :

pd.merge(df1, df2, on='ts', how='outer')

See also the documentation of merge method.

answered Nov 21 '22 at 12:33

ClaireD

26
4

thank you for your answer. I do not think my question is duplicate because I want to know how to apply this merge() function to all text files (file1&file2) which are in different subfolders and with your suggestion, I can only merge two text files in 1 subfolder by specifying their file path – Siavash Nov 21 '22 at 13:23

Python: How to read and compare text files in 80 separate folders and merge them into one single pandas dataframe based on a condition?

1 Answers1