0

df is unstructured with no columns and rows header. Every columns have strings in which there is a set of pattern which needs to be removed, the pattern is mentioned below as:

Input to one columns of unstructured df as strings:

I am to be read ===start=== I am to be removed ===stop=== I have to be read again ===start=== remove me again ===stop=== continue reading

Ouput needed:

I am to be read I have to be read again continue reading

Here I have to remove from string '===start===' to '===stop===' whenever it occurs. The df has thousands of entries. What is the most efficient way of using regex?

The code below works on a column but takes a long time to complete.

Is there a solution using regex that is most efficient/least time complexity?

df = pd.read_excel("sample_excel.xlsx", header=None)
def removeString(df):
 inf = df[0][1]
 infcopy = ''
bol = False
start = '*start*'
end = '*stop*'
inf.replace('* start *',start) #in case black space between start
inf.replace('* stop *',end)    #in case black space between start
for i in range(len(inf)):
 if inf[i] == "*" and inf[i:i+len(start)] == start:
  bol = True
 if inf[i] == '*' and inf[i+1-len(end):i+1] == end:
  bol = False
continue
 if bol == False:
 infcopy += inf[i]
df[0][1] = infcopy
think-maths
  • 917
  • 2
  • 10
  • 28
  • 1
    Welcome to SO..!! please check [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and [How to create a Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) – anky Jan 23 '19 at 08:06
  • What does your input data look like? Can you give a sample? – Josh Friedlander Jan 23 '19 at 08:10
  • I/p data could be: – think-maths Jan 23 '19 at 08:32
  • I/p data could be: Read the data from here ===start=== I am to be removed ===stop=== read it again ===start=== i am to be removed again ===stop=== read it again from here o/p should be: Read data from here read it again read it again from here _______________________________________________ ===start=== to ===stop=== should be ignored it could occur multiple times in entire df with huge data. Also it might be blank space in between === start === and === stop === so that should also be considered. Program has to be most efficient – think-maths Jan 23 '19 at 08:44
  • If all you are looking for is to optimize, one quick suggestion is to try the following: (1) Write a function that runs the algorithm and takes just a string as argument, and returns the modified string. (2) Use `df.apply()` to run this function on all cells, which will run optimally due to pandas internal implementation. – crazyGamer Jan 23 '19 at 09:09
  • @SachinKumar can you add a sample data in your question please , thanks – anky Jan 23 '19 at 09:14

1 Answers1

0

I think it could look something like this.

import pandas as pd
import re

def removeString(df):
    pattern = r'(?:start(.*?)stop)'
    df[ColToRemove] = df[ColToRemove].apply(lambda x: re.sub(pattern, "",x))

E.g.

df = pd.DataFrame({'Col1':['startjustsomethingherestop']})

Output:

                         Col1
0  startjustsomethingherestop

And then,

pattern = r'(?:start(.*?)stop)'
df['Col1'] = df['Col1'].apply(lambda x: re.sub(pattern, "", x))

Output:

  Col1
0

The regex pattern defined here will remove everything whenever a match is found for a string that begins with "start" and ends with "stop" and leave you its as the output

kerwei
  • 1,822
  • 1
  • 13
  • 22
  • Thanks but it is working for just one 'start' and 'stop'. It is failing when there are multiple 'start and stop' it takes the first 'start 'and last stop. The frame will will have multiple start and stops – think-maths Jan 23 '19 at 11:54
  • @SachinKumar Hi, in that case, please try the non-greedy quantifier. Replace pattern with ```pattern = r'(?:start(.*?)stop)'```. I think this is what you're looking for – kerwei Jan 23 '19 at 14:40
  • It works thank you...execution time has been very efficient as well..cheers – think-maths Jan 23 '19 at 16:27
  • it works on on the single columns but when i take data frame from excel having multiple rows and running the function on entire columns it is not working. df = pd.read_excel("sample_excel.xlsx",dtype = str, header=None) df1 = pd.DataFrame({'Col1':[df]}) pattern = r'(?:start(.*?)stop)' df1['Col1'] = df1['Col1'].apply(lambda x: re.sub(pattern, "", x)) error: return _compile(pattern, flags).sub(repl, string, count) TypeError: expected string or bytes-like object – think-maths Jan 24 '19 at 08:12
  • @SachinKumar Is there a reason you need to wrap df into _Col1_ of df1? ```pd.read_excel``` already returns you a DataFrame object that you can work directly on. If you really need to wrap it, then I would suggest that you perform the regex substitution on df first before calling ```df1 = pd.DataFrame({'Col1':[df]})``` – kerwei Jan 24 '19 at 08:23
  • the problem is even after using pattern = r'(?:start(.*?)stop)' while having dataframe from pd.excel it is doing the operation of removing initial start and last stop not the consequent start stops. But working on normal string this pattern works fine. This caught me in dilemma as why not the same result in dataframe as well any insight can you provide on this? – think-maths Jan 24 '19 at 08:35
  • Can you test it out on one of the target values in your dataframe here [link](https://regex101.com/r/HKgTPy/2) and verify that it works? – kerwei Jan 24 '19 at 08:39
  • Just to add that in my answer above, 'Col1' is just a reference to the column that you need to perform regex substitution on. You will have to change its label accordingly and if your dataframe does not contain headers, you should be using the column number index – kerwei Jan 24 '19 at 08:41
  • With the link you provided I ran the input: No title start \n Please see additional information stop Initiate\n Primary GU start : Not able to ring stop sequence to be ended It has two matching patterns but it matched with only one i.e (start : Not able to ring stop) not the first one. There are patterns which are either missed for even i structure it it runs from first start to last stop – think-maths Jan 24 '19 at 09:16
  • Copied and pasted your input into the same regex link and returned 2 matches: **Match1 (full)**```start \n Please see additional information stop``` **Match1 (group)** ```\n Please see additional information``` **Match2 (full)** ```start : Not able to ring stop``` **Match2 (group)** ```: Not able to ring``` – kerwei Jan 24 '19 at 09:23
  • Please take '\n' as line change press enter there. In that case it is taking only one match group Match 1 Full match 76-105 start : Not able to ring stop Group 1. 81-101 : Not able to ring – think-maths Jan 24 '19 at 09:31
  • I see. In that case, I think you can try to compile the regex pattern first, with the DOTALL flag. ```rg = re.compile(pattern, re.DOTALL)```. And then, in your dataframe apply, you do ```df[ColToRemove] = df[ColToRemove].apply(lambda x: re.sub(rg, "",x))```. I hope this solves the newline issue, based on the documentation [link](https://docs.python.org/2/library/re.html#re.finditer) – kerwei Jan 24 '19 at 09:45
  • Otherwise, a quick alternative is to replace those newlines with a pipe character or something first before treating it with regex – kerwei Jan 24 '19 at 09:48
  • DOTALL works for most of the cases except a few in which there are spaces between start and stop and some extra characters gets snipped if conditioning the pattern. Will add some more in the pattern you have provided to avoid some odd niggles. Thanks a lot (y) – think-maths Jan 24 '19 at 11:03