Extract integers after double space with regex

Question

I have a dataframe where I want to extract stuff after double space. For all rows in column NAME there is a double white space after the company names before the integers.

                                    NAME  INVESTMENT  PERCENT
0     APPLE COMPANY A  57 638 232 stocks     OIL LTD  0.12322
1  BANANA 1 COMPANY B  12 946 201 stocks    GOLD LTD  0.02768
2     ORANGE COMPANY C  8 354 229 stocks     GAS LTD  0.01786

df = pd.DataFrame({
    'NAME': ['APPLE COMPANY A  57 638 232 stocks', 'BANANA 1 COMPANY B  12 946 201 stocks', 'ORANGE COMPANY C  8 354 229 stocks'],
    'PERCENT': [0.12322, 0.02768 , 0.01786]
    })

I have this earlier, but it also includes integers in the company name:

df['STOCKS']=df['NAME'].str.findall(r'\b\d+\b').apply(lambda x: ''.join(x))

Instead I tried to extract after double spaces

df['NAME'].str.split('(\s{2})')

which gives output:

0       [APPLE COMPANY A,   , 57 638 232 stocks]
1    [BANANA 1 COMPANY B,   , 12 946 201 stocks]
2       [ORANGE COMPANY C,   , 8 354 229 stocks]

However, I want the integers that occur after double spaces to be joined/merged and put into a new column.

                 NAME  PERCENT  STOCKS
0     APPLE COMPANY A  0.12322  57638232
1  BANANA 1 COMPANY B  0.02768  12946201
2    ORANGE COMPANY C  0.01786  12946201

How can I modify my second function to do what I want?

Why do you want to use regex ? it's slow and you can just split on 2 spaces (`.split(' ')`) also, i assume that by removing your first column and creating a new one while putting the content into it it'd work — Nenri, Mar 15 '19 at 07:56

Wiktor Stribiżew · Accepted Answer · 2019-03-15T09:03:22.813

4

Following the original logic you may use

df['STOCKS'] = df['NAME'].str.extract(r'\s{2,}(\d+(?:\s\d+)*)', expand=False).str.replace(r'\s+', '')
df['NAME'] = df['NAME'].str.replace(r'\s{2,}\d+(?:\s\d+)*\s+stocks', '')

Output:

                 NAME  PERCENT    STOCKS
0     APPLE COMPANY A  0.12322  57638232
1  BANANA 1 COMPANY B  0.02768  12946201
2    ORANGE COMPANY C  0.01786   8354229

Details

\s{2,}(\d+(?:\s\d+)*) is used to extract the first occurrence of whitespace-separated consecutive digit chunks after 2 or more whitespaces and .replace(r'\s+', '') removes any whitespaces in that extracted text afterwards
.replace(r'\s{2,}\d+(?:\s\d+)*\s+stocks' updates the text in the NAME column, it removes 2 or more whitespaces, consecutive whitespace-separated digit chunks and then 1+ whitespaces and stocks. Actually, the last \s+stocks may be replaced with .* if there are other words.

edited Mar 15 '19 at 09:03

answered Mar 15 '19 at 08:05

Wiktor Stribiżew

607,720
39
448
563

regex should not be the best solution, but this works fine so, up for this – Nenri Mar 15 '19 at 08:10
I get the error message: 'DataFrame' object has no attribute 'str' for `df['NAME'].str.extract(r'\s{2,}(\d+(?:\s\d+)*)').str.replace('\s+', '')` – Mataunited18 Mar 15 '19 at 08:13
@Mataunited17 I used the data from your question and it works in my Python 3.6. – Wiktor Stribiżew Mar 15 '19 at 08:14
I see. I am using 3.7.1. Strange. – Mataunited18 Mar 15 '19 at 08:15
Is there anything I can do to make it compatible with my version? – Mataunited18 Mar 15 '19 at 08:25
1

@Mataunited17 Added `expand=False` so that `extract` only returned the series and tested in Python 3.7. Also works in Python 3.6. – Wiktor Stribiżew Mar 15 '19 at 09:06
@WiktorStribiżew Yes. It worked. Thanks. I think I will accept your answer, it explained more than the other answer. – Mataunited18 Mar 15 '19 at 13:07

Chris Adams · Answer 2 · 2019-03-15T09:05:36.763

3

Another pandas approach, which will cast STOCKS to numeric type:

df_split = (df['NAME'].str.extractall('^(?P<NAME>.+)\s{2}(?P<STOCKS>[\d\s]+)')
            .reset_index(level=1, drop=True))

df_split['STOCKS'] = pd.to_numeric(df_split.STOCKS.str.replace('\D', ''))

Assign these columns back into your original DataFrame:

df[['NAME', 'STOCKS']] = df_split[['NAME', 'STOCKS']]

         COMPANY_NAME    STOCKS  PERCENT
0     APPLE COMPANY A  57638232  0.12322
1  BANANA 1 COMPANY B  12946201  0.02768
2    ORANGE COMPANY C   8354229  0.01786

edited Mar 15 '19 at 09:05

answered Mar 15 '19 at 08:13

Chris Adams

18,389
4
22
39

This solution was the best so far. However, does the solution replace any other column that exists from before? My original dataframe has 3 columns. – Mataunited18 Mar 15 '19 at 08:59
1

It wont replace, it creates a new `DataFrame`. I can update my answer to assign back into the original df if that's preferred? – Chris Adams Mar 15 '19 at 09:01
1

Thanks for the edit! This was exactly what I was looking for. – Mataunited18 Mar 15 '19 at 09:25

Justice_Lords · Answer 3 · 2019-03-15T08:08:56.057

1

You can use look behind and look ahead operators.

''.join(re.findall(r'(?<=\s{2})(.*)(?=stocks)',string)).replace(' ','')

This catches all characters between two spaces and the word stocks and replace all the spaces with null.

Another Solution using Split

df["NAME"].apply(lambda x:x[x.find('  ')+2:x.find('stocks')-1].replace(' ',''))

Reference:-

Look_behind

edited Mar 15 '19 at 08:08

answered Mar 15 '19 at 07:59

Justice_Lords

949
5
14

Or he could just do `.split(' ')[1].split()[0]` which is way faster than regex (2 spaces in the first split) – Nenri Mar 15 '19 at 08:01
@Mataunited17 can you show me what you tried to do ? that should work just fine – Nenri Mar 15 '19 at 08:05
@Nenri I did `df['NAME'].str.split(' ')[1].split()[0]` which gave me error: 'list' object has no attribute 'split'. Which is strange, because I have a dataframe. – Mataunited18 Mar 15 '19 at 08:07
yeah, and `.str` is supposed to return you a string – Nenri Mar 15 '19 at 08:09
@Justice_Lords When I applied your second solution to my original dataframe, the outcome is strange when the names are very long. Is there a way to fix this? I think it has to do with the `+2` part of `x:x[x.find(' ')+2:x.find('stocks')` – Mataunited18 Mar 15 '19 at 08:37

Vaghinak · Answer 4 · 2019-03-15T08:03:52.620

0

You can try

df['STOCKS'] = df['NAME'].str.split(',')[2].replace(' ', '')
df['NAME'] = df['NAME'].str.split(',')[0]

edited Mar 15 '19 at 08:03

answered Mar 15 '19 at 07:58

Vaghinak

535
6
12

Thx, but still, there's no comma in his string, you should split on spaces and as one has 2 spaces, it should be `.split()[3].split()[0]` – Nenri Mar 15 '19 at 08:04
Oh sorry I forgot to change it – Vaghinak Mar 15 '19 at 08:04
@Vaghinak this didn't work either. I get the error message: 'list' object has no attribute 'replace' – Mataunited18 Mar 15 '19 at 08:05
1

@Mataunited17 'cause you need to split on spaces and yeah this answer is false, he forgot a lot of things – Nenri Mar 15 '19 at 08:06

Loochie · Answer 5 · 2019-03-15T09:40:25.867

0

This can be done without using regex by using split.

df['STOCKS'] = df['NAME'].apply(lambda x: ''.join(x.split('  ')[1].split(' ')[:-1]))
df['NAME'] = df['NAME'].str.replace(r'\s?\d+(?:\s\d+).*', '')

edited Mar 15 '19 at 09:40

answered Mar 15 '19 at 09:23

Loochie

2,414
13
20

Extract integers after double space with regex

5 Answers5