Get the position of a substring in a column of DataFrame using regex

Question

I'd like to break up a string into Pandas DataFrame columns using a regex.

Sample csv data [Updated]:

Data;Code;Temp;....
12 364 OPR 4 67474;;33;...
893 73 GDP hdj 747;;34;...
hr 777 hr9 GDP;;30;...
463 7g 448 OPR;;28;...

Desired situation: [Updated]

Data           | Code        | Temp     | ...
------------------------------------------------
12 364         | OPR 4 67474 | 33       | ...
893 73         | GDP hdj 747 | 34       | ...
hr 777 hr9 GDP | NaN         | 30       | ...
463 7g 448 OPR | NaN         | 28       | ...

regex:

code = re.compile('\sOPR.?[^$]|\sGDP.?[^$]')

I only need to split if OPR or GDP is not at the end of the string. I was looking for a way to split based on the match position. Something like: match.start())
I tried something like: df['data'].str.contains(code, regex=True) and df['data'] = df['data'].str.extract(code, expand=True) and str.find only seems to work with a string and not with re.Pattern. I don't get it done.

I'm pretty new with Pandas, so please bear with me.

Do you wan to split 1 column of Pandas Dataframe into 2 columns? — Rusty, Dec 15 '18 at 11:30
`NaN` in 2nd column and `hr 777 hr9 GDP` in 1st is desired? I don't get the rule. — Zydnar, Dec 15 '18 at 11:31
In regex you can use quantity of specific match eg: `\d{3}` Match groups can also be helpful. — Zydnar, Dec 15 '18 at 11:38
@Rusty: Yes that's what i want. Please see desired situation — John Doe, Dec 15 '18 at 11:54
@Zydnar: I only need to split if `OPR` or `GDP` is not at the end of the string. Thats the case for 0 and 1, For 2 and 3 it's at the and and there's no need to split. — John Doe, Dec 15 '18 at 11:56
@Zydnar: How can i use this with Pandas DataFrames ? The sample is only 4 lines, but the actual data is in a large DataFrame — John Doe, Dec 15 '18 at 11:57

Chris Doyle · Accepted Answer · 2018-12-15T13:58:07.667

2

I am fairly new to python so someone might be able to comment if this is not a good approach. My line of thinking was to take the input and process it line by line. drop the trailing semi colon as you dont have it in your output. then using regex split the line by a space char only if its followed by either OPR or GDP and which is not at the end of the line. If this gives only one item in the list, then append the list with NaN to fill the second column. then i have printed with formatting.

import re
data_string="""12 364 OPR 4 67474;
893 73 GDP hdj 747;
hr 777 hr9 GDP;
463 7g 448 OPR;
"""

data_list=data_string.splitlines()
for data in data_list:
    data_split=re.split("\s(?=(?:GDP|OPR)[^$])",data[:-1])
    if len(data_split)==1: data_split.append("NaN")
    print("%-20s|%-20s" % tuple(data_split))

OUTPUT

12 364              |OPR 4 67474         
893 73              |GDP hdj 747         
hr 777 hr9 GDP      |NaN                 
463 7g 448 OPR      |NaN

Updated in light of question edit and comments

Based on your update to the question and comments you could try the below. I would suggest you to test this and check for any edge cases or add validation or conditional checks before performing updates.

import pandas as pd
import re

source_data = {'data': ['12 364 OPR 4 67474', '893 73 GDP hdj 747', 'hr 777 hr9 GDP','463 7g 448 OPR'],
         'code': [None, None, None, None],
         'Temp': [33,34,30,28]
         }
df = pd.DataFrame.from_dict(source_data)
print("Original df:")
print(df, "\n")

row_iter=df.iterrows()
for index,row in row_iter:
    data=df.at[index,'data']
    data_split=re.split("\s(?=(?:GDP|OPR)[^$])",data)
    if len(data_split)==2:
        df.at[index,'data']=data_split[0]
        df.at[index,'code']=data_split[1]

print("Updated df:")
print(df)

OUTPUT

Original df:
                 data  code  Temp
0  12 364 OPR 4 67474  None    33
1  893 73 GDP hdj 747  None    34
2      hr 777 hr9 GDP  None    30
3      463 7g 448 OPR  None    28 

Updated df:
             data         code  Temp
0          12 364  OPR 4 67474    33
1          893 73  GDP hdj 747    34
2  hr 777 hr9 GDP         None    30
3  463 7g 448 OPR         None    28

edited Dec 15 '18 at 13:58

answered Dec 15 '18 at 12:08

Chris Doyle

10,703
2
23
42

Sorry for the misunderstanding but there are more columns. I just created a simple example. I read the csv using pd.read_cvs(fileName, delimiter=';') – John Doe Dec 15 '18 at 12:20
Doesnt the 2 answers you already have give you enough of an example or idea of how to tackle your problem such that you could have closed the question and gone away to work on a solution? – Chris Doyle Dec 15 '18 at 12:31
I understand the samples, but not how to implement this using Pandas DataFrames instead of a list. I probably do not see the connection, sorry – John Doe Dec 15 '18 at 12:37
your question was how to break up a string using regex. thats what both answers do. if you now have a question of how to turn a list in to a dataframe then thats a different questions and should be asked as a new question. However have you tried having a look here: `http://pbpython.com/pandas-list-dict.html` – Chris Doyle Dec 15 '18 at 12:41
I think it's clear that it's related to Pandas and DataFrames as mentiontioned in my question (and the Tags). I also posted some Panda related code i've already tried. I'm not trying to read data from a list or dict to a DataFrame. The data is imported using Pandas.read_csv. So I already have a DataFrame and want to break up the string from one column to two columns. – John Doe Dec 15 '18 at 12:50
have you looked at `https://stackoverflow.com/questions/14745022/how-to-split-a-column-into-two-columns` – Chris Doyle Dec 15 '18 at 13:02
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/185295/discussion-between-chris-doyle-and-john-doe). – Chris Doyle Dec 15 '18 at 13:12
Yes. The problem is that I need to know the position. So I was thinking to use a regex to find the match position (like `match.end()` or `match.start()` - that's what I usually use for strings) and use that for the split. But I don't know how to use regex with a Pandas column value. Maybe this was a better title for my question: 'Get the index or position of a substring in a column of A Python dataframe – Pandas' – John Doe Dec 15 '18 at 13:14
Many thanks for the update. I also tried to to get `OPR` or `GDP` (pattern) included to the data column instead of the code column. I tried a lookbehind but received an error `look-behind requires fixed-width pattern`. Is this easy to fix? I want to learn more about the way you use the regex. – John Doe Dec 15 '18 at 15:24

score 0 · Answer 2 · answered Dec 15 '18 at 11:26

So first u gotta check if the data has GDP or OPR at the end. If not then u can used grouped regex to get the desired items. Here stuff in rounded brackets () represents a group. I have named them using syntax ?P that's optinal.

import re

data = ["12 364 OPR 4 67474;",
   "893 73 GDP hdj 747;",
   "hr 777 hr9 GDP;",
   "463 7g 448 OPR;"]

for item in data:
# first check if it ends with GPR; or OPR;
if re.search("GDP;|OPR;$", item):
    # as u specified it needs to be ignored
    print(item)
else:
    # now u can split into two parts - i am splitting in three but u can do use them however u like
    splited_match_obj = re.search("(?P<Data>.+)(?P<Value>OPR|GDP)(?P<Code>.+)", item)
    print(splited_match_obj["Data"], splited_match_obj["Value"], splited_match_obj["Code"] )

How can i use this with Pandas DataFrames ? The sample is only 4 lines, but the actual data is in a large DataFrame imported from a csv file — John Doe, Dec 15 '18 at 12:00

Vaishali · Answer 3 · 2018-12-15T14:48:12.700

Lets say this is your dataframe,

    Data                Temp
0   12 364 OPR 4 67474  33
1   893 73 GDP hdj 747  34
2   hr 777 hr9 GDP      30
3   463 7g 448 OPR      28

You can use extract with multiple capture groups based on condition

df1[['Data', 'Code']] = df.loc[~df['Data'].str.endswith(('OPR','GDP')), 'Data'].str.extract('(.*)([A-Z]{3} .*)')
df2[['Data', 'Code']] = df.loc[df['Data'].str.endswith(('OPR','GDP')), 'Data'].str.extract('(.*[OPR|GDP]$)(.*)')
df[['Data', 'Code']] = pd.concat([df1,df2])

    Data            Temp    Code
0   12 364          33      OPR 4 67474
1   893 73          34      GDP hdj 747
2   hr 777 hr9 GDP  30  
3   463 7g 448 OPR  28

Get the position of a substring in a column of DataFrame using regex

3 Answers3