4

I have a small program to search into many big files (+500.000 rows per file) and export the result to a csv file. I would like to know if it is possible to stop searching after finding a specific date in the files. For example, after finding ini_date (column 2) value (for example 02/12/2020), the program should stop searching and export the result including rows which cointain "02/12/2020" in column 2 and also match other searching criteria.

Currently I have 73 datalog.log files in the folder and this number is getting continuosly increased. The datalog0.log is the older file and datalog72.log is the newest and in some time it will be datalog73.log (I would like to start searching in the lastest file). Is this possible to do just with python? If not, I am going to have to also make use of SQL for this.

Here you can see my code:

import pandas as pd
from glob import glob

files = glob('C:/ProgramA/datalog*.log')
df = pd.concat([pd.read_csv(f, 
          low_memory=False
                  sep=',', 
                  names=["0","1","2","3","4","5","6","7"]) for f in files])


#Column 0: IP
#Column 1: User
#Column 2: Date
#Column 3: Hour

ip = input('Optional - Set IP: ')  #column 0
user = input('Optional - Set User: ')     #column 1
ini_date = input('Mandatory - From Day (Formant MM/DD/YYYY): ')   
fin_date = input('Mandatory - To Day (Formant MM/DD/YYYY): ')  
ini_hour = input('Mandatory - From Hour (Formant 00:00:00): ')  
fin_hour = input('Mandatory - To Hour (Formant 00:00:00): ')   

if ip == '' and user == '':
    df1 = df[(df["2"] >= ini_date) & (df["2"] <= fin_date) & (df["3"] >= ini_hour) & (df["3"] <= fin_hour)]
elif ip == '':
    df1 = df[(df["1"] == user) & (df["2"] >= ini_date) & (df["2"] <= fin_date) & (df["3"] >= ini_hour) & (df["3"] <= fin_hour)]
elif user == '':
    df1 = df[(df["0"] == ip) & (df["2"] >= ini_date) & (df["2"] <= fin_date) & (df["3"] >= ini_hour) & (df["3"] <= fin_hour)]
else:
    df1 = df[(df["0"] == ip) & (df["1"] == user) & (df["2"] >= ini_date) & (df["2"] <= fin_date) & (df["3"] >= ini_hour) & (df["3"] <= fin_hour)]

df1.to_csv ('C:/ProgramA/result.csv', index = False) 

Thanks.


Logs look like the following example:

Yes, logs are sequential and look in this way:

File0:
        1.1.1.1      user       09/24/2020       09:18:00    Other data...................
        1.1.1.1      user       09/24/2020       10:00:00    Other data...................
        1.1.1.1      user       09/25/2020       07:30:00    Other data...................
        1.1.1.1      user       09/25/2020       09:30:00    Other data...................

File1:
        1.1.1.1      user       09/26/2020       04:18:00    Other data...................
        1.1.1.1      user       09/26/2020       10:00:00    Other data...................
        1.1.1.1      user       09/26/2020       11:18:00    Other data...................
        1.1.1.1      user       09/26/2020       12:00:00    Other data...................

File2:
        1.1.1.1      user       09/26/2020       14:18:00    Other data...................
        1.1.1.1      user       09/27/2020       16:00:00    Other data...................
        1.1.1.1      user       09/28/2020       10:18:00    Other data...................
        1.1.1.1      user       09/29/2020       12:00:00    Other data...................

So, if I am filtering by ini_date >="09/27/2020" and fin_date <="09/27/2020", I would like that the program stops searching and export only this from File2 (otherwise, the program would innecesarily check the other 2 files taking more time):

        1.1.1.1      user       09/27/2020       16:00:00    Other data...................
        1.1.1.1      user       09/28/2020       10:18:00    Other data...................
Debon54
  • 79
  • 4
  • It would help to answer if you could also post a sample of how your log entry looks like. Since you logs seem to be sequential, I think this should be possible with python. – sai Sep 27 '20 at 09:06
  • Yes, logs are sequential. Main post edited to add an example. Thanks. – Debon54 Sep 27 '20 at 12:02

2 Answers2

1
import glob
import os
import pandas as pd

list_of_files = glob.glob('/path/to/folder/*')

# Sorts files based on creation date
sorted_file_names = sorted(list_of_files, key=os.path.getctime, reverse = True)

rows_found = False
final_df = pd.DataFrame()
for file in sorted_file_names:
    df = pd.read_csv(file)

    # {Perform required operations}

    # Fetches required rows
    df1 = df.loc[(df['2'] <= fin_date) & (df['2'] >= ini_date)]
 
    # If required rows don't exist in current file but existed in previous file, break
    if not df1.empty:
        rows_found = True
        final_df = final_df.append(df1, ignore_index=False)
    elif rows_found:
        break

final_df.to_csv("Name.csv")
Shradha
  • 2,232
  • 1
  • 14
  • 26
  • Thanks for the answer. The problem with this is that I have a lot of rows which will match with the ini_date value, and I also need to get all the rows which match with this, not just the first one. – Debon54 Sep 27 '20 at 12:08
  • `df.loc[df['2'] == ini_date]` will fetch all the rows from the dataframe which match `ini_date`. – Shradha Sep 27 '20 at 12:33
  • Ok thanks for the answer. I have tested it and it works but, how can I ensure that the program starts searching in the last file of the folder? I mean, start searching in datalog72.log (last created file) instead of datalog0.log? Thanks – Debon54 Sep 27 '20 at 16:58
  • Please see the update above. I've not tested it, let me know if it doesn't work. – Shradha Sep 27 '20 at 18:35
  • 1
    Than you very much, I have used this code with some moddification and it has worked great for what I needed :-). – Debon54 Sep 28 '20 at 09:07
  • I have been testing this code today and I have been relized that the program is pretty slow when we have tons of files in the folder. I am wondering if instead of making a loop to check each row if it is inside the data renge with this line: df2 = df1.iloc[(df1['2'] >= ini_date) & (df1['2'] <= fin_date)] , instead I could say the program that only should check the first row. I have tried to do it with this but I am getting an error: df2 = df1[(df1['2'] >= ini_date) & (df1['2'] <= fin_date)].iloc[1]. Any idea to speed up the program? Thanks! – Debon54 Sep 28 '20 at 18:10
1

The answer given by @Shradha should find/fetch you all the entries for the date that you are searching for, once you have all these entries you could apply other filters to these subset of entries alone to save computation and time.

I initially thought setting the dates as index of the dataframe would reduce the time to find the log entries, but I was wrong. A boolean mask works faster than indexing.

import pandas as pd
import datetime
import numpy as np
import time

if __name__ == '__main__':
    df = pd.read_csv('~/Documents/tmp.csv', names=["version", "user", "date", "time", "data1", "data2"])
    df.set_index('date', inplace=True)
    df.index = pd.to_datetime(df.index, dayfirst=True)
    print df.loc[datetime.date(2020, 9, 27)]
    print '############################'

    date_index = pd.date_range(start='1/1/1850', periods=100000)  # 100000 entries
    some_data = pd.Series(np.random.randint(1, 100, size=date_index.shape))
    df = pd.DataFrame(data={'some_data': some_data})
    df.index = date_index
    df = df.append([df, df, df, df])
    print 'shape of df is: ', df.shape

    start = time.time()
    print df.loc[datetime.date(2020, 3, 14)]
    end = time.time()
    print"time taken is: ", end - start
    print '############################'

    df.reset_index(inplace=True)
    df.columns = ['my_index', 'some_data']
    start = time.time()
    print df.loc[df['my_index'] == datetime.date(2020, 3, 14)]
    end = time.time()
    print"time taken is: ", end - start
    print '############################'

Output

version     1.1.1.1
user           user
time       16:00:00
data1         Other
data2          data
Name: 2020-09-27 00:00:00, dtype: object
############################
shape of df is:  (500000, 1)
version     1.1.1.1
user           user
time       16:00:00
data1         Other
data2          data
Name: 2020-09-27 00:00:00, dtype: object
############################
shape of df is:  (500000, 1)
############################
Through direct indexing
            some_data
2020-03-14         93
2020-03-14         93
2020-03-14         93
2020-03-14         93
2020-03-14         93
time taken is:  0.0407321453094
############################
Using boolean mask
         my_index  some_data
62164  2020-03-14         93
162164 2020-03-14         93
262164 2020-03-14         93
362164 2020-03-14         93
462164 2020-03-14         93
time taken is:  0.00653505325317
############################
sai
  • 1,734
  • 1
  • 7
  • 13
  • Thanks for answer. Yes, my main goal is to try to decrease computing time as much as possible. The thing is that I need a way to tell my program that starts searching in the last created file and start searching in the last file's row. Why? Because if there is the same date in two files, the break will stop the loop and I will miss data from the previous file. – Debon54 Sep 27 '20 at 17:58
  • @Debon54 you could pre sort the files based on their time of creation by following the answer from [here](https://stackoverflow.com/questions/168409/how-do-you-get-a-directory-listing-sorted-by-creation-date-in-python) and then process them as you wish. – sai Sep 27 '20 at 18:19
  • Tthank you very much. And do you know how to start iterating from last row in a csv file? Thanks! – Debon54 Sep 27 '20 at 18:23
  • You are welcome and to iterate reverse you could use negative indices. – sai Sep 27 '20 at 18:31