2

I want to extract a data column where each cell is a string type consisting a hotel's room number and occupied packages on a given time. Each cell looks like the following

                          624: COUPLE , 507: DELUXE+ ,301: HONEYMOON 

Here's the code snippet I have written to collect all the room numbers occupied and the packages purchased.

import numpy as np
import pandas as pd
d = np.array(['624: COUPLE , 507: DELUXE+ ,301: HONEYMOON','614:FAMILY , 507: FAMILY+'])
df = pd.Series(d)
df= df.str.extractall(r'(?P<room>[0-9]+)(?P<package>[\S][^,]+)')
df
          

However the output keeps the colon in front of package name. Output of given python code

How do I remove the colon in front of package name in the output ????

1 Answers1

1

You can put : and an optional whitespace patterns between the two named capturing groups and use

>>> df.str.extractall(r'(?P<room>[0-9]+):\s*(?P<package>[^\s,]+)')
        room    package
  match                
0 0      624     COUPLE
  1      507    DELUXE+
  2      301  HONEYMOON
1 0      614     FAMILY
  1      507    FAMILY+

See the regex demo. Details:

  • (?P<room>[0-9]+) - Group "room": one or more digits
  • :\s* - a colon and then zero or more whitespaces
  • (?P<package>[^\s,]+) - Group "package": one or more chars other than whitespace and a comma.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Now I need to create dedicated column for each room and each cell under those column might be occupied or available. what would be the efficient way to do so ? For example , a sing column 'room#507' will be added to the original dataset having {DELUZE+ , FAMILY+ ,.....} as values. – BN production Feb 20 '21 at 15:59
  • @BNproduction If you have a new question, please consider accepting this one and ask a new question – Wiktor Stribiżew Feb 20 '21 at 16:00
  • please help me on this topic : https://stackoverflow.com/questions/66301681/create-pandas-column-using-cell-values-of-another-multi-indexed-data-frame – BN production Feb 21 '21 at 11:15