df is unstructured with no columns and rows header. Every columns have strings in which there is a set of pattern which needs to be removed, the pattern is mentioned below as:
Input to one columns of unstructured df as strings:
I am to be read ===start=== I am to be removed ===stop=== I have to be read again ===start=== remove me again ===stop=== continue reading
Ouput needed:
I am to be read I have to be read again continue reading
Here I have to remove from string '===start===' to '===stop===' whenever it occurs. The df has thousands of entries. What is the most efficient way of using regex?
The code below works on a column but takes a long time to complete.
Is there a solution using regex that is most efficient/least time complexity?
df = pd.read_excel("sample_excel.xlsx", header=None)
def removeString(df):
inf = df[0][1]
infcopy = ''
bol = False
start = '*start*'
end = '*stop*'
inf.replace('* start *',start) #in case black space between start
inf.replace('* stop *',end) #in case black space between start
for i in range(len(inf)):
if inf[i] == "*" and inf[i:i+len(start)] == start:
bol = True
if inf[i] == '*' and inf[i+1-len(end):i+1] == end:
bol = False
continue
if bol == False:
infcopy += inf[i]
df[0][1] = infcopy