I have a .csv file with data from different runs. So, my file looks like:
execution on Monday
name1, name2, name3
1,2,3
4,5,6
end Monday
execution on Tuesday
name1, name2, name3
7,8,9
10,11,12
end Tuesday
Thereby, it is unknown how many rows there are for one run.
I am looking for an efficient way to read only the rows from a specific run into a pandas DataFrame. Right now, I read the file, compare each line with the searched string and return the number of the start and end line. Then I do pandas.read_csv with skiprow to start from the found start line and afterwards delete everything below the end line.
Edit: The ouput I wish for would be a DataFrame which looks like:
name1 name2 name3
7 8 9
10 11 12
My current code is:
import pandas as pd
filename="test.csv"
durchgang="Tuesday"
def findstart(filename,durchgang):
file=open(filename,"r")
start=0
for line in file:
start+=1
if line[len("execution on "):len("execution on ")+len(durchgang)] == durchgang:
return line, start
else:
pass
file.close()
def findend(filename,durchgang):
file=open(filename,"r")
end=0
for line in file:
end+=1
elif line[0:len("end ")+len(durchgang)] == "end "+durchgang:
return line, end
start=findstart(filename,durchgang)[1]
end=findend(filename, durchgang)[1]
df=pd.read_csv(filename, header=1, skiprows=start)
df.drop(index=[end:len(df.index)])
Both, deleting the not used rows and first searching through the file with readline seems very unelegant. Is there a better solution that does not go through the file twice. I saw people using the query condition to filter, but this needs an index.