0

I have a .csv file with data from different runs. So, my file looks like:

execution on Monday
name1, name2, name3
1,2,3
4,5,6
end Monday

execution on Tuesday
name1, name2, name3
7,8,9
10,11,12
end Tuesday

Thereby, it is unknown how many rows there are for one run.

I am looking for an efficient way to read only the rows from a specific run into a pandas DataFrame. Right now, I read the file, compare each line with the searched string and return the number of the start and end line. Then I do pandas.read_csv with skiprow to start from the found start line and afterwards delete everything below the end line.

Edit: The ouput I wish for would be a DataFrame which looks like:

name1 name2 name3
7  8  9
10 11 12

My current code is:

import pandas as pd

filename="test.csv"
durchgang="Tuesday"

def findstart(filename,durchgang):
    file=open(filename,"r")
    start=0
    for line in file:
        start+=1
        if line[len("execution on "):len("execution on ")+len(durchgang)] == durchgang:
            return line, start
        else:
            pass
    file.close()

def findend(filename,durchgang):
    file=open(filename,"r")
    end=0
    for line in file:
        end+=1
        elif line[0:len("end ")+len(durchgang)] == "end "+durchgang:
            return line, end

start=findstart(filename,durchgang)[1]
end=findend(filename, durchgang)[1]

df=pd.read_csv(filename, header=1, skiprows=start)
df.drop(index=[end:len(df.index)])

Both, deleting the not used rows and first searching through the file with readline seems very unelegant. Is there a better solution that does not go through the file twice. I saw people using the query condition to filter, but this needs an index.

  • Please include the code you have tried. And your expected output. – not_speshal Mar 08 '22 at 14:20
  • you have non-standard file so don't expect special functions for this file. But maybe it would be simpler to read all as single string (`read()`)and split on empty lines `split("\n\n")` to get every part as separated string. And later use `io.StringIO` to read directly from selected string `read_csv( io.StringIO(text), ...)` – furas Mar 08 '22 at 14:33
  • or maybe you should read line by line and check which line is empty - and when you find starting empty line then read lines to next empty line and keep all these lines in list or single string - and later use `io.StringIO` to load from this string. – furas Mar 08 '22 at 14:40
  • how do you recognize run which you want to load? By `Monday`,`Tuesday` or something different? Describe it in question (not in comment) so all people will see it. – furas Mar 08 '22 at 15:04

2 Answers2

1

It seems you have empty line betwin runs - so you could use this fact.

And you can use fact that pandas can read from file-like-object

You can read all as single string and split on empty lines

runs = text.split("\n\n")

and use selected part with module io to create file-like-object in memory

file_like_object = io.StringIO(selected_text)

and read from this file

pd.read_csv( file_like_object, ...)

Minimal working code

I use example text directly in code (so everyone can copy and test it) but should use open() and text = read()

text = '''execution on Monday
name1, name2, name3
1,2,3
4,5,6
end Monday

execution on Tuesday
name1, name2, name3
7,8,9
10,11,12
end Tuesday

execution on Tuesday
name1, name2, name3
107,108,109
110,111,112
end Tuesday'''

import io
import pandas as pd

# read all as single string
#with open("input.csv") as f:
#    text = f.read()

# split on empty lines
runs = text.split('\n\n')

# get selected `run` 
selected = runs[1]

# create file-like-object (file in memory)
file_object = io.StringIO(selected)

# read from file-like-object
df = pd.read_csv(file_object , skiprows=1, skipfooter=1)

print(df)

EDIT:

If file is too big to load it and split in memory then you still can use your method to search part but you should keep this part in memory and use io.StringIO to load to DataFrame without reading file again.


Minimal working example which read line by line until find execution on Tuesday and next read line by line until find end Tuesday but also keep lines in string - and later use this string with io.StringIO to load to DataFrame`

text = '''execution on Monday
name1, name2, name3
1,2,3
4,5,6
end Monday

execution on Tuesday
name1, name2, name3
7,8,9
10,11,12
end Tuesday

execution on Friday
name1, name2, name3
107,108,109
110,111,112
end Tuesday'''

import io
import pandas as pd

# place for selected part
selected = ""

name = 'Tuesday'
start_text = f'execution on {name}'
end_text = f'end {name}'

# open file only once
#f = open("input.csv")
f = io.StringIO(text)

# search start text
for line in f:
    if start_text in line:
        break
    
# search end text and keep lines in string
for line in f:
    if end_text in line:
        break
    selected += line

#print(selected)

# use string to create `file-like-object and load to DataFrame

file_object = io.StringIO(selected)

df = pd.read_csv(file_object)

print(df)
furas
  • 134,197
  • 12
  • 106
  • 148
  • The problem here is that I do not know which part it is. The original file is quite large, so I cannot count by hand, that the part I want has eg. index 1 here. – qcabepsilon Mar 08 '22 at 14:55
  • how do you recognize which part do you want? You should describe it in question. But now you can use `for`-loop to check every part if it is part which you need. OR you can still use your method and read line by line but keep lines/part which match your selection and load directly from this part instead of load file again. – furas Mar 08 '22 at 14:56
  • I recognize it by the name given in "execution on". – qcabepsilon Mar 08 '22 at 15:13
  • I added example with read line by line – furas Mar 08 '22 at 15:13
0

Not really sure if this is what you are looking for, but you could consider adding a MultiIndex to the dataframe you are storing. So your index levels could look something like ['run_id', 'usual_id']. For each run you would generate an unique index and then when you wish to store your results you append the current run dataframe to the existing csv file following this post. This will provide you pretty convenient way of looking through your runs since you could load them all in a single dataframe during analysis.

Grinjero
  • 436
  • 2
  • 7