How to create separate substring column from dataframe column value

Question

I have dataframe like:

Instru,Name
16834306,INFOSYS18SEP640.50PE
16834306,INFOSYS18SEP640.50PE
16834306,BHEL18SEP52.80CE
16834306,BHEL18SEP52.80CE
16834306,IOCL18SEP640PE
16834306,IOCL18SEP640PE

I want create separate column by taking string from Name Column as below:

Instru,Name,Symbol,Month,SP,Type
16834306,INFOSYS18SEP640.50PE,INFOSYS,18SEP,640.50,PE
16834306,INFOSYS18SEP640.50PE,INFOSYS,18SEP,640.50,PE
16834306,BHEL18SEP52.80CE,BHEL,18SEP,52.80,CE    
16834306,BHEL18SEP52.80CE,BHEL,18SEP,52.80,CE
16834306,IOCL18SEP640PE,IOCL,18SEP,640,PE
16834306,IOCL18SEP640PE,IOCL,18SEP,640,PE

Note: Decimal to appear as decimal and int as int for SP Column

is the structure always the same? 4 characters,5 characters, then a flota and then the last two characters? — Yuca, Sep 07 '18 at 13:53
No Sir, it may vary as per the company symbol. String before '18SEP' and after may be considered. — Pravat, Sep 07 '18 at 13:54
If you have a regex for it see answer here - https://stackoverflow.com/questions/46928636/pandas-split-list-into-columns-with-regex — Tom Ron, Sep 07 '18 at 14:03
Possible duplicate of [Pandas extract numbers from column into new columns](https://stackoverflow.com/questions/52171736/pandas-extract-numbers-from-column-into-new-columns) — Lev Zakharov, Sep 07 '18 at 14:04

piRSquared · Accepted Answer · 2018-09-07T14:37:13.347

Using pandas.Series.str.extract with named groups in regex pattern

pat = '(?P<Symbol>.*?)(?P<Month>\d{1,2}\w{3})(?P<SP>[\d\.]+)(?P<Type>.*)'
df.join(df.Name.str.extract(pat))

     Instru                  Name   Symbol  Month      SP Type
0  16834306  INFOSYS18SEP640.50PE  INFOSYS  18SEP  640.50   PE
1  16834306  INFOSYS18SEP640.50PE  INFOSYS  18SEP  640.50   PE
2  16834306      BHEL18SEP52.80CE     BHEL  18SEP   52.80   CE
3  16834306      BHEL18SEP52.80CE     BHEL  18SEP   52.80   CE
4  16834306        IOCL18SEP640PE     IOCL  18SEP     640   PE
5  16834306        IOCL18SEP640PE     IOCL  18SEP     640   PE

Explanation of the regex patter

regex is a funny fuzzy business and is an art form. I'll explain what I did and why. You can compare what I did relative to @jonclements and see that we both attacked the problem with the same approach but made subtly different assumptions.

'(?P<group_name>pattern)' Is a way to create a capture group and name it with 'group_name'
'(?P<Symbol>.*?)' Grabs all characters up to the next capture group, the '?' says don't be greedy about it.
'(?P<Month>\d{1,2}\w{3})' Grabs 1 or 2 digits then 3 letters. The vagueness of 1 or 2 digits is why I made the prior group non-greedy.
'(?P<SP>[\d\.]+)' Grab one or more digits or periods. Admittedly, this isn't terribly graceful as it could grab '4.2.4.5' but it should get the job done.
'(?P<Type>.*)' Plays clean up and grabs the rest.

Great answer. It could be useful if you explained the regex patters. — Alexander, Sep 07 '18 at 14:13

Jon Clements · Answer 2 · 2018-09-07T14:16:23.590

You can use str.extract and apply .astype to the result to get your desired columns and your specific numeric column as a float:

separated = df.Name.str.extract(r"""(?ix)
    (?P<Symbol>[a-z]+)     # all letters up to a date that matches
    (?P<Month>\d{2}\w{3})  # the date (2 numbers then 3 letters)
    (?P<SP>.*?)            # everything until the "type"
    (?P<Type>\w{2}$)       # Last two characters of string is the type
""").astype({'SP': 'float'})

Which'll give you:

    Symbol  Month     SP Type
0  INFOSYS  18SEP  640.5   PE
1  INFOSYS  18SEP  640.5   PE
2     BHEL  18SEP   52.8   CE
3     BHEL  18SEP   52.8   CE
4     IOCL  18SEP  640.0   PE
5     IOCL  18SEP  640.0   PE

Then apply df.join(separated) to get your final DF of:

     Instru                  Name   Symbol  Month     SP Type
0  16834306  INFOSYS18SEP640.50PE  INFOSYS  18SEP  640.5   PE
1  16834306  INFOSYS18SEP640.50PE  INFOSYS  18SEP  640.5   PE
2  16834306      BHEL18SEP52.80CE     BHEL  18SEP   52.8   CE
3  16834306      BHEL18SEP52.80CE     BHEL  18SEP   52.8   CE
4  16834306        IOCL18SEP640PE     IOCL  18SEP  640.0   PE
5  16834306        IOCL18SEP640PE     IOCL  18SEP  640.0   PE

I like your explanation of the regex pattern. I'll make the same effort and contrast with yours. — piRSquared, Sep 07 '18 at 14:11

score 2 · Answer 3 · answered Sep 07 '18 at 14:07

You can define your splitting function and create the desired output

def f(x):
    for i, c in enumerate(x):
        if c.isdigit():        
            break
    return [x[0:i], x[i:9], x[9:-2], x[-2:]]

df[['Symbol','Month','SP','Type']] = pd.DataFrame(df.Name.apply(f).tolist())

     Instru               Name Symbol  Month      SP Type
0  16834306  INFY18SEP640.50PE   INFY  18SEP  640.50   PE
1  16834306  INFY18SEP640.50PE   INFY  18SEP  640.50   PE
2  16834306   BHEL18SEP52.80CE   BHEL  18SEP   52.80   CE
3  16834306   BHEL18SEP52.80CE   BHEL  18SEP   52.80   CE
4  16834306     IOCL18SEP640PE   IOCL  18SEP     640   PE
5  16834306     IOCL18SEP640PE   IOCL  18SEP     640   PE

How to create separate substring column from dataframe column value

3 Answers3

Explanation of the regex patter