0

I have found a separate solutions for parts of what I want to do but nothing that has worked together.

  1. I am searching a set of files (call them set1) and creating a pandas data frame (df) with a date (as yyyy.doy), a dump number(2 digit number), and a start time(VC5 Start) of the data inside of the file(in gps seconds). the date and dump numbers are actually designated in the file name. the dataframe also has a blank column for the data in the next step.I have no issues here with the following code.
df=pd.DataFrame(columns=['Date','Dump Number','VC5 Start', 'VC2 Start'])

for files in VC5filelist:
    #print(files)    
    filedate_df=files[18:26]   #save date of file to a variable
    filedump_df=files[31:33]  #save dump number of file to a variable

    ds=netCDF4.Dataset(files, 'r')     #read each netcdf file into a data set
    VC5gpsstart=ds.variables['TIME'][0]   # save gps first timestamp of VC5 file into a variable
    #append file data to main dataframe
    df = df.append({'Date' : filedate_df, 'Dump' : filedump_df, 'VC5 Start' : VC5gpsstart},ignore_index = True)
  1. I am searching a second set of files (call them set2) for the same info however the start time will be VC2time.
for vc2files in VC2filelist:
    print(vc2files)
    vc2filedate_df=vc2files[18:26]   #save date of file to a variable
    vc2filedump_df=vc2files[31:33]  #save dump number of file to a variable
    print(vc2filedate_df+':'+vc2filedump_df)

    dsvc2=netCDF4.Dataset(vc2files, 'r')     #read each netcdf file into a data set
    VC2gpstart=dsvc2.variables['time'][0]   # save gps first timestamp of VC5 file into a variable
    VC2df = VC2df.append({'Date' : vc2filedate_df, 'Dump' : vc2filedump_df, 'VC2 Start' : VC2gpsstart},ignore_index = True)

I want to append/insert the VC2time data into the last column(VC2 Start) and use the date and dump numbers of the second set of files to designate where in the dataframe the starttime should go. example

Date       Dump       vc5start       vc2start
2022.001   05         121651215      ***456447156***

the bold and italic data is the only thing i cannot produce right now. I have been trying a find the correct row to insert my data with

row=df.index.get_loc(df.query('Date' == vc2filedate_df) and ('Dump'==vc2filedump_df).index[0])

to no avail. my next step was to be

df.loc[row:'VC2 Start']=VC2gpsstart

what I want to know is

A: given the date and dump number of the file from my set2 , how do I find the row of the dataframe with the same date and dump number?

B: how do I then add the VC2 start data into the VC2 start column of the data frame on the row found in question A?

@Larrybird

VC5df                         VC2df
Date     Dump   VC5time       Date    Dump  VC2time
2022.001   01    125         2022.001  01     125
2022.001   02    128         2022.001  02     130
2022.001   05    260         2022.001  05     261
2022.002   01    035         2022.002  01     035

@LarryBird, I after researching merge I found the (a) solution

creating datframes

VC5df=pd.DataFrame(columns=['Date','Dump','VC5 Start'])
VC2df=pd.DataFrame(columns=['Date','Dump','VC2 Start'])

appending data to them within loops (as above), then using

merged_df=pd.merge(VC5df,VC2df,on=["Date","Dump"])

creates the following (looking at first and second day of 2022)

         Date Dump     VC5 Start     VC2 Start
0    2022.001   01  1325029429.0  1325029440.0
1    2022.001   02  1325030705.0  1325030760.0
2    2022.001   03  1325034031.0  1325034060.0
3    2022.001   04  1325035511.0  1325035560.0
4    2022.001   05  1325036791.0  1325036879.0
..        ...  ...           ...           ...
103  2022.002   48  1325188946.0  1325188980.0
104  2022.002   49  1325191628.0  1325191680.0
105  2022.002   50  1325192627.0  1325192640.0
106  2022.002   51  1325195052.0  1325195100.0
107  2022.002   52  1325198890.0  1325198940.0
  • It sounds like you might be better off doing a `.join()` / `.merge()` instead of trying to explicitly find the row index yourself. For example if both dataframes are indexed by `Date` and `dump`, you could do `df1.merge(df2, on=['Date', 'dump'])` (or something to that effect). I can provide an example of how this works if you will add some dummy data to your question reproducing your problem. – LarryBird Sep 13 '22 at 21:55
  • @LarryBird: I do not know how to add sample data (sorry) but added what the two data frames would look like if I where to create them separately. – mperrotta913 Sep 14 '22 at 13:37
  • to provide sample data you can copy and paste a small section of your data into your question then format it as code (highlight the block and press ctrl+k). Alternatively, you can provide code that will reproduce the data someone else can easily copy and run (eg `df = pd.DataFrame(np.arange(9).reshape(3,3))` ). You should have a read [how to ask](https://stackoverflow.com/help/how-to-ask) to understand more about asking good questions. By the way, did merge solve your problem? – LarryBird Sep 14 '22 at 21:06
  • @LarryBird, yes. the merge worked perfectly. I am not sure how to mark a comment as the answer though. – mperrotta913 Sep 15 '22 at 19:00
  • I don't think you can - I have added as answer with some extra information. Glad it helped. – LarryBird Sep 15 '22 at 21:57

1 Answers1

0

It sounds like you might be better off doing a .join() / .merge() instead of trying to explicitly find the row index yourself. For example if both dataframes are indexed by Date and dump, you could do df1.merge(df2, on=['Date', 'dump']) (or something to that effect).

If you are interested, there is an excellent summary of join() and merge() on this answer. Basically if both dataframes have matching index, and you wish to join on the index, you can use df1.join(df2) to save typing. merge() is more flexible in that you can specify various combinations of index or columns to do the join on.

Also worth knowing is pd.concat, (see docs here) which is another useful function when you are combining data. In particular, it can be more efficient (and more readable) if you need to join many dataframes, since you can call it on a list of dataframes in one line instead of having to loop through and join multiple times.

Hope this helps.

LarryBird
  • 333
  • 1
  • 7