I have a small program to search into many big files (+500.000 rows per file) and export the result to a csv file. I would like to know if it is possible to stop searching after finding a specific date in the files. For example, after finding ini_date (column 2) value (for example 02/12/2020), the program should stop searching and export the result including rows which cointain "02/12/2020" in column 2 and also match other searching criteria.
Currently I have 73 datalog.log files in the folder and this number is getting continuosly increased. The datalog0.log is the older file and datalog72.log is the newest and in some time it will be datalog73.log (I would like to start searching in the lastest file). Is this possible to do just with python? If not, I am going to have to also make use of SQL for this.
Here you can see my code:
import pandas as pd
from glob import glob
files = glob('C:/ProgramA/datalog*.log')
df = pd.concat([pd.read_csv(f,
low_memory=False
sep=',',
names=["0","1","2","3","4","5","6","7"]) for f in files])
#Column 0: IP
#Column 1: User
#Column 2: Date
#Column 3: Hour
ip = input('Optional - Set IP: ') #column 0
user = input('Optional - Set User: ') #column 1
ini_date = input('Mandatory - From Day (Formant MM/DD/YYYY): ')
fin_date = input('Mandatory - To Day (Formant MM/DD/YYYY): ')
ini_hour = input('Mandatory - From Hour (Formant 00:00:00): ')
fin_hour = input('Mandatory - To Hour (Formant 00:00:00): ')
if ip == '' and user == '':
df1 = df[(df["2"] >= ini_date) & (df["2"] <= fin_date) & (df["3"] >= ini_hour) & (df["3"] <= fin_hour)]
elif ip == '':
df1 = df[(df["1"] == user) & (df["2"] >= ini_date) & (df["2"] <= fin_date) & (df["3"] >= ini_hour) & (df["3"] <= fin_hour)]
elif user == '':
df1 = df[(df["0"] == ip) & (df["2"] >= ini_date) & (df["2"] <= fin_date) & (df["3"] >= ini_hour) & (df["3"] <= fin_hour)]
else:
df1 = df[(df["0"] == ip) & (df["1"] == user) & (df["2"] >= ini_date) & (df["2"] <= fin_date) & (df["3"] >= ini_hour) & (df["3"] <= fin_hour)]
df1.to_csv ('C:/ProgramA/result.csv', index = False)
Thanks.
Logs look like the following example:
Yes, logs are sequential and look in this way:
File0:
1.1.1.1 user 09/24/2020 09:18:00 Other data...................
1.1.1.1 user 09/24/2020 10:00:00 Other data...................
1.1.1.1 user 09/25/2020 07:30:00 Other data...................
1.1.1.1 user 09/25/2020 09:30:00 Other data...................
File1:
1.1.1.1 user 09/26/2020 04:18:00 Other data...................
1.1.1.1 user 09/26/2020 10:00:00 Other data...................
1.1.1.1 user 09/26/2020 11:18:00 Other data...................
1.1.1.1 user 09/26/2020 12:00:00 Other data...................
File2:
1.1.1.1 user 09/26/2020 14:18:00 Other data...................
1.1.1.1 user 09/27/2020 16:00:00 Other data...................
1.1.1.1 user 09/28/2020 10:18:00 Other data...................
1.1.1.1 user 09/29/2020 12:00:00 Other data...................
So, if I am filtering by ini_date >="09/27/2020" and fin_date <="09/27/2020", I would like that the program stops searching and export only this from File2 (otherwise, the program would innecesarily check the other 2 files taking more time):
1.1.1.1 user 09/27/2020 16:00:00 Other data...................
1.1.1.1 user 09/28/2020 10:18:00 Other data...................