3

Very new to pandas so any explanation with a solution is appreciated.

I have a dataframe such as

    Company                             Zip State City
1   *CBRE                               San Diego, CA 92101
4   1908 Brands                         Boulder, CO 80301
7   1st Infantry Division Headquarters  Fort Riley, KS
10  21st Century Healthcare, Inc.       Tempe 85282
15  AAA                                 Jefferson City, MO 65101-9564

I want to split the Zip State city column in my data into 3 different columns. Using the answer from this post Pandas DataFrame, how do i split a column into two I could accomplish this task if I didn't have my first column. Writing a regex to captures all companies just leads to me capturing everything in my data.

I also tried

foo = lambda x: pandas.Series([i for i in reversed(x.split())])
data_pretty = data['Zip State City'].apply(foo)

but this causes me to loose the company column and splits the names of the cities that are more than one word into separate columns.

How can I split my last column while keeping the company column data?

Community
  • 1
  • 1
Jstuff
  • 1,266
  • 2
  • 16
  • 27
  • Its not right to edited the question after its been answered. – Merlin Jul 18 '16 at 17:46
  • @Merlin wrote a comment on MaxU's question and needed to edit the post to show what I was asking. How would you accomplishing this? – Jstuff Jul 18 '16 at 17:47

1 Answers1

8

you can use extract() method:

In [110]: df
Out[110]:
                               Company                 Zip State City
1                                *CBRE            San Diego, CA 92101
4                          1908 Brands              Boulder, CO 80301
7   1st Infantry Division Headquarters                 Fort Riley, KS
10       21st Century Healthcare, Inc.                    Tempe 85282
15                                 AAA  Jefferson City, MO 65101-9564

In [112]: df[['City','State','ZIP']] = df['Zip State City'].str.extract(r'([^,\d]+)?[,]*\s*([A-Z]{2})?\s*([\d\-]{4,11})?', expand=True)

In [113]: df
Out[113]:
                               Company                 Zip State City            City State         ZIP
1                                *CBRE            San Diego, CA 92101       San Diego    CA       92101
4                          1908 Brands              Boulder, CO 80301         Boulder    CO       80301
7   1st Infantry Division Headquarters                 Fort Riley, KS      Fort Riley    KS         NaN
10       21st Century Healthcare, Inc.                    Tempe 85282          Tempe    NaN       85282
15                                 AAA  Jefferson City, MO 65101-9564  Jefferson City    MO  65101-9564

From docs:

Series.str.extract(pat, flags=0, expand=None)

For each subject string in the Series, extract groups from the first match of regular expression pat.

New in version 0.13.0.

Parameters:

pat : string

Regular expression pattern with capturing groups

flags : int, default 0 (no flags)

re module flags, e.g. re.IGNORECASE .. versionadded:: 0.18.0

expand : bool, default False

If True, return DataFrame.

If False, return Series/Index/DataFrame.

Returns: DataFrame with one row for each subject string, and one column for each group. Any capture group names in regular expression pat will be used for column names; otherwise capture group numbers will be used. The dtype of each result column is always object, even when no match is found. If expand=True and pat has only one capture group, then return a Series (if subject is a Series) or Index (if subject is an Index).

Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419