How to split a dataframe string column into two columns?

Question

I have a data frame with one (string) column and I'd like to split it into two (string) columns, with one column header as 'fips' and the other 'row'

My dataframe df looks like this:

          row
0    00000 UNITED STATES
1    01000 ALABAMA
2    01001 Autauga County, AL
3    01003 Baldwin County, AL
4    01005 Barbour County, AL

I do not know how to use df.row.str[:] to achieve my goal of splitting the row cell. I can use df['fips'] = hello to add a new column and populate it with hello. Any ideas?

         fips       row
0    00000 UNITED STATES
1    01000 ALABAMA 
2    01001 Autauga County, AL
3    01003 Baldwin County, AL
4    01005 Barbour County, AL

how did you load your data into pandas? You might be able to laod the data in your desired format using `read_table()` or `read_fwf()` — zach, Feb 07 '13 at 23:29
*"How to split a column"* has different answers depending on whether the column is string, list, or something else, also what format (e.g. 'formatted string' like an address, for which you might need to use a regex. Here you have a string column with fixed-width format ("ZZZZZ placename...") so we know the zipcode is characters 0:4 and the placename is characters 6: — smci, Oct 29 '20 at 20:09

score 767 · Answer 1 · edited Oct 29 '20 at 05:48

767

TL;DR version:

For the simple case of:

I have a text column with a delimiter and I want two columns

The simplest solution is:

df[['A', 'B']] = df['AB'].str.split(' ', 1, expand=True)

You must use expand=True if your strings have a non-uniform number of splits and you want None to replace the missing values.

Notice how, in either case, the .tolist() method is not necessary. Neither is zip().

In detail:

Andy Hayden's solution is most excellent in demonstrating the power of the str.extract() method.

But for a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the .str.split() method is enough¹. It operates on a column (Series) of strings, and returns a column (Series) of lists:

>>> import pandas as pd
>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2']})
>>> df

      AB
0  A1-B1
1  A2-B2
>>> df['AB_split'] = df['AB'].str.split('-')
>>> df

      AB  AB_split
0  A1-B1  [A1, B1]
1  A2-B2  [A2, B2]

_{1: If you're unsure what the first two parameters of .str.split() do,
I recommend the docs for the plain Python version of the method.}

But how do you go from:

a column containing two-element lists

to:

two columns, each containing the respective element of the lists?

Well, we need to take a closer look at the .str attribute of a column.

It's a magical object that is used to collect methods that treat each element in a column as a string, and then apply the respective method in each element as efficient as possible:

>>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})
>>> upper_lower_df

   U
0  A
1  B
2  C
>>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()
>>> upper_lower_df

   U  L
0  A  a
1  B  b
2  C  c

But it also has an "indexing" interface for getting each element of a string by its index:

>>> df['AB'].str[0]

0    A
1    A
Name: AB, dtype: object

>>> df['AB'].str[1]

0    1
1    2
Name: AB, dtype: object

Of course, this indexing interface of .str doesn't really care if each element it's indexing is actually a string, as long as it can be indexed, so:

>>> df['AB'].str.split('-', 1).str[0]

0    A1
1    A2
Name: AB, dtype: object

>>> df['AB'].str.split('-', 1).str[1]

0    B1
1    B2
Name: AB, dtype: object

Then, it's a simple matter of taking advantage of the Python tuple unpacking of iterables to do

>>> df['A'], df['B'] = df['AB'].str.split('-', 1).str
>>> df

      AB  AB_split   A   B
0  A1-B1  [A1, B1]  A1  B1
1  A2-B2  [A2, B2]  A2  B2

Of course, getting a DataFrame out of splitting a column of strings is so useful that the .str.split() method can do it for you with the expand=True parameter:

>>> df['AB'].str.split('-', 1, expand=True)

    0   1
0  A1  B1
1  A2  B2

So, another way of accomplishing what we wanted is to do:

>>> df = df[['AB']]
>>> df

      AB
0  A1-B1
1  A2-B2

>>> df.join(df['AB'].str.split('-', 1, expand=True).rename(columns={0:'A', 1:'B'}))

      AB   A   B
0  A1-B1  A1  B1
1  A2-B2  A2  B2

The expand=True version, although longer, has a distinct advantage over the tuple unpacking method. Tuple unpacking doesn't deal well with splits of different lengths:

>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2', 'A3-B3-C3']})
>>> df
         AB
0     A1-B1
1     A2-B2
2  A3-B3-C3
>>> df['A'], df['B'], df['C'] = df['AB'].str.split('-')
Traceback (most recent call last):
  [...]    
ValueError: Length of values does not match length of index
>>>

But expand=True handles it nicely by placing None in the columns for which there aren't enough "splits":

>>> df.join(
...     df['AB'].str.split('-', expand=True).rename(
...         columns={0:'A', 1:'B', 2:'C'}
...     )
... )
         AB   A   B     C
0     A1-B1  A1  B1  None
1     A2-B2  A2  B2  None
2  A3-B3-C3  A3  B3    C3

edited Oct 29 '20 at 05:48

cs95

379,657
97
704
746

answered Sep 06 '16 at 22:52

LeoRochael

14,191
6
32
38

1

df['A'], df['B'] = df['AB'].str.split(' ', 1).str What is the meaning of '1' in split(' ', 1) ? – Hariprasad Mar 30 '17 at 13:12
1

@Hariprasad, it's the maximum number of splits. I've added a link to the [docs for the Python Version of the `.split()` method](https://docs.python.org/3.6/library/stdtypes.html#str.split) which explain the first two parameters better than the Pandas docs. – LeoRochael Mar 30 '17 at 14:49
9

pandas 1.0.0 reports "FutureWarning: Columnar iteration over characters will be deprecated in future releases." – Frank Feb 05 '20 at 19:17
3

This works under Python 1.0.1. `df.join(df['AB'].str.split('-', 1, expand=True).rename(columns={0:'A', 1:'B'}))` – Martien Lubberink Mar 07 '20 at 00:32
@MartienLubberink Instead of that you can assign to a DataFrame sub-slice: `df[['A', 'B']] = df['AB'].str.split('-', n=1, expand=True)` – cs95 Oct 29 '20 at 05:45
This is great, but how do you split when you have two different split areas? Long story short: I'm reading in SQL and want to take the FROM statement and split it into database, schema, table, alias. Database, schema, and table are easy - I split at the period/dot. But, alias? Sometimes there could be just a space and sometimes people would put in AS. What would you suggest? – DataGirl Jun 29 '21 at 13:21
1

@DataGirl, for that I'd suggest [`Series.str.extract`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html) with an appropriately written regex. – LeoRochael Jul 01 '21 at 23:12
If I use this I get `ValueError: Columns must be same length as key` – John Vandivier Oct 14 '21 at 23:25
1

@JohnVandivier, It's possible none of the values in your split column achieve the number of splits you expect, so the split operation is not producing the number of columns you're trying to assign. Check [this pandas bug](https://github.com/pandas-dev/pandas/issues/35807) and perhaps [this other S.O. question](https://stackoverflow.com/questions/46585193/pandas-error-in-python-columns-must-be-same-length-as-key). I'll try to think of ways to expand this answer to overcome this issue, or at least warn about it. – LeoRochael Oct 23 '21 at 01:48
Hello, thanks for this. How can I split an object such as 324786ABCD or -577.89CBC into two columns, one with text, one with the numbers? – Baobab Mar 17 '22 at 16:05
1

@Baobab, I think your comment should be its own Stack Overflow question, since this question is about splitting on separators. But in a hunch I'd suggest you try the [`.str.extract()`](https://stackoverflow.com/a/21296915/1273938) method with a regex like `r'(?P[+-.\d]+)(?P[a-zA-Z]+)'`. – LeoRochael Mar 17 '22 at 18:07
@Hariprasad your question , sample[meta_desc_as_column_name] = sample['comb'].str.split('|', 5, expand=True) – Vinay Chaudhari May 20 '22 at 11:01
@Hariprasad Lets take Example :- sample[['X','Y','Z']] = sample['pattern'].str.split('|', 3, expand=True) , Lets assume , you have string pattern like this "A|B|C" and you want to create a new column in same data frame , as X , Y , Z with pattern column spitted values. So the purpose of giving 3 here is to tell break my string 3 times and fill respective values in it. Let's say if you put 2 that time output will look like this: A , B|C means total of 2 values. – Vinay Chaudhari May 20 '22 at 11:08
With Pandas 2.0.0 I must use this syntax or else I get an error: `df['AB'].str.split(pat=' ', n=1, expand=True)` – Florin Andrei Apr 26 '23 at 21:54
If there are more than many columns and you don't know the number, you can just keep n undefined. – Songhua Hu Jun 29 '23 at 22:42
@SonghuaHu, right. If one doesn't specify `n`, they must take care that the dataframe resulting from `str.split(..., expand=True)` will have an arbitrary number of columns, and deal with it appropriately. – LeoRochael Jun 30 '23 at 07:41

score 184 · Accepted Answer · edited Mar 05 '21 at 21:49

184

There might be a better way, but this here's one approach:

                            row
    0       00000 UNITED STATES
    1             01000 ALABAMA
    2  01001 Autauga County, AL
    3  01003 Baldwin County, AL
    4  01005 Barbour County, AL

df = pd.DataFrame(df.row.str.split(' ',1).tolist(),
                                 columns = ['fips','row'])

   fips                 row
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

edited Mar 05 '21 at 21:49

Pranay Aryal

5,208
4
30
41

answered Feb 07 '13 at 07:03

root

76,608
25
108
120

12

Be aware that .tolist() will remove any indexes you had, so your new Dataframe will be reindexed from 0 (It doesn't matter in your specific case). – Crashthatch Mar 27 '13 at 14:59
16

@Crashthatch -- then again you can just add `index = df.index` and you are good. – root Mar 27 '13 at 15:07
what if one cell can't be split? – Nisba Oct 30 '18 at 09:02
@Nisba: If any cell can't be split (e.g. string doesn't contain any space for this case) it will still work but one part of the split will be empty. Other situations will happen in case you have mixed types in the column with at least one cell containing any number type. Then the `split` method returns NaN and the `tolist` method will return this value as is (NaN) which will result in `ValueError` (to overcome this issue you can cast it to string type before splitting). I recommend you to try it on your own it's the best way of learning :-) – Nerxis Dec 18 '19 at 08:07
1

@techkuz: Are you sure your `df` has the `row` column header? You may think it's some kind of DataFrame attribute but it's quite clear this is the name of the column. It's up to you how you create and define your column headers so if you use different one use it (e.g. `df.my_column_name.split(...)`). – Nerxis Dec 18 '19 at 08:12

Andy Hayden · Answer 3 · 2016-04-13T01:27:49.803

78

You can extract the different parts out quite neatly using a regex pattern:

In [11]: df.row.str.extract('(?P<fips>\d{5})((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))')
Out[11]: 
    fips                    1           state           county state_code
0  00000        UNITED STATES   UNITED STATES              NaN        NaN
1  01000              ALABAMA         ALABAMA              NaN        NaN
2  01001   Autauga County, AL             NaN   Autauga County         AL
3  01003   Baldwin County, AL             NaN   Baldwin County         AL
4  01005   Barbour County, AL             NaN   Barbour County         AL

[5 rows x 5 columns]

To explain the somewhat long regex:

(?P<fips>\d{5})

Matches the five digits (\d) and names them "fips".

The next part:

((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))

Does either (|) one of two things:

(?P<state>[A-Z ]*$)

Matches any number (*) of capital letters or spaces ([A-Z ]) and names this "state" before the end of the string ($),

or

(?P<county>.*?), (?P<state_code>[A-Z]{2}$))

matches anything else (.*) then
a comma and a space then
matches the two digit state_code before the end of the string ($).

In the example:
Note that the first two rows hit the "state" (leaving NaN in the county and state_code columns), whilst the last three hit the county, state_code (leaving NaN in the state column).

edited Apr 13 '16 at 01:27

answered Jan 23 '14 at 00:20

Andy Hayden

359,921
101
625
535

2

This is definitely the best solution but it might be a bit overwhelming to some with the very extensive regex. Why not do that as a part 2 and have part 1 with just the fips and row columns? – Little Bobby Tables Apr 12 '16 at 11:48
3

@josh that's a good point, whilst the individual parts of the regex are "easy" to understand, long regex can get complicated quickly. I added some explanation for future readers! (I also had to update the link to the docs which explains the `(?P...)` syntax! I've no idea why I went for the more complex regex, clearly the simple one could work hmmmm – Andy Hayden Apr 13 '16 at 01:30
2

Looks much friendlier. I'm glad you did because it had me look at the docs to understand the ``. Now I know it it makes my code very succinct. – Little Bobby Tables Apr 13 '16 at 09:56
But does not this return only the first match for repeating patterns, like `123-456` if you search for digits with `(\d+)`? In other words, you have to know the exact pattern you are searching for, it can't dynamically grow to accommodate `123`, `123-456` and `123-456-789` – ruslaniv Mar 23 '22 at 10:50

score 55 · Answer 4 · edited Jun 24 '18 at 12:57

55

df[['fips', 'row']] = df['row'].str.split(' ', n=1, expand=True)

edited Jun 24 '18 at 12:57

4b0

21,981
30
95
142

answered Jun 24 '18 at 12:21

Bhagabat Behera

853
7
7

jezrael · Answer 5 · 2018-04-21T12:43:42.950

You can use str.split by whitespace (default separator) and parameter expand=True for DataFrame with assign to new columns:

df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA', 
                           '01001 Autauga County, AL', '01003 Baldwin County, AL', 
                           '01005 Barbour County, AL']})
print (df)
                        row
0       00000 UNITED STATES
1             01000 ALABAMA
2  01001 Autauga County, AL
3  01003 Baldwin County, AL
4  01005 Barbour County, AL



df[['a','b']] = df['row'].str.split(n=1, expand=True)
print (df)
                        row      a                   b
0       00000 UNITED STATES  00000       UNITED STATES
1             01000 ALABAMA  01000             ALABAMA
2  01001 Autauga County, AL  01001  Autauga County, AL
3  01003 Baldwin County, AL  01003  Baldwin County, AL
4  01005 Barbour County, AL  01005  Barbour County, AL

Modification if need remove original column with DataFrame.pop

df[['a','b']] = df.pop('row').str.split(n=1, expand=True)
print (df)
       a                   b
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

What is same like:

df[['a','b']] = df['row'].str.split(n=1, expand=True)
df = df.drop('row', axis=1)
print (df)

       a                   b
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

If get error:

#remove n=1 for split by all whitespaces
df[['a','b']] = df['row'].str.split(expand=True)

ValueError: Columns must be same length as key

You can check and it return 4 column DataFrame, not only 2:

print (df['row'].str.split(expand=True))
       0        1        2     3
0  00000   UNITED   STATES  None
1  01000  ALABAMA     None  None
2  01001  Autauga  County,    AL
3  01003  Baldwin  County,    AL
4  01005  Barbour  County,    AL

Then solution is append new DataFrame by join:

df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA', 
                           '01001 Autauga County, AL', '01003 Baldwin County, AL', 
                           '01005 Barbour County, AL'],
                    'a':range(5)})
print (df)
   a                       row
0  0       00000 UNITED STATES
1  1             01000 ALABAMA
2  2  01001 Autauga County, AL
3  3  01003 Baldwin County, AL
4  4  01005 Barbour County, AL

df = df.join(df['row'].str.split(expand=True))
print (df)

   a                       row      0        1        2     3
0  0       00000 UNITED STATES  00000   UNITED   STATES  None
1  1             01000 ALABAMA  01000  ALABAMA     None  None
2  2  01001 Autauga County, AL  01001  Autauga  County,    AL
3  3  01003 Baldwin County, AL  01003  Baldwin  County,    AL
4  4  01005 Barbour County, AL  01005  Barbour  County,    AL

With remove original column (if there are also another columns):

df = df.join(df.pop('row').str.split(expand=True))
print (df)
   a      0        1        2     3
0  0  00000   UNITED   STATES  None
1  1  01000  ALABAMA     None  None
2  2  01001  Autauga  County,    AL
3  3  01003  Baldwin  County,    AL
4  4  01005  Barbour  County,    AL

If there are two such columns, which need to be split, it returns: `ValueError: columns overlap but no suffix specified:` — learner, Jul 20 '20 at 20:58

keberwein · Answer 6 · 2015-09-29T18:08:43.013

26

If you don't want to create a new dataframe, or if your dataframe has more columns than just the ones you want to split, you could:

df["flips"], df["row_name"] = zip(*df["row"].str.split().tolist())
del df["row"]

edited Sep 29 '15 at 18:08

answered Sep 29 '15 at 17:52

keberwein

536
4
8

1

I get a `zip argument #1 must support iteration` error, python 2.7 – Allan Ruin Aug 03 '16 at 13:44

weaming · Answer 7 · 2021-04-16T03:54:13.320

Use df.assign to create a new df. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html

split = df_selected['name'].str.split(',', 1, expand=True)
df_split = df_selected.assign(first_name=split[0], last_name=split[1])
df_split.drop('name', 1, inplace=True)

Or in method chain form:

df_split = (df_selected
            .assign(list_col=lambda df: df['name'].str.split(',', 1, expand=False),
                    first_name=lambda df: df.list_col.str[0],
                    last_name=lambda df: df.list_col.str[1])
            .drop(columns=['list_col']))

score 12 · Answer 8 · answered Jan 15 '18 at 12:08

If you want to split a string into more than two columns based on a delimiter you can omit the 'maximum splits' parameter.
You can use:

df['column_name'].str.split('/', expand=True)

This will automatically create as many columns as the maximum number of fields included in any of your initial strings.

cs95 · Answer 9 · 2019-05-03T20:03:08.020

Surprised I haven't seen this one yet. If you only need two splits, I highly recommend. . .

`Series.str.partition`

partition performs one split on the separator, and is generally quite performant.

df['row'].str.partition(' ')[[0, 2]]

       0                   2
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

If you need to rename the rows,

df['row'].str.partition(' ')[[0, 2]].rename({0: 'fips', 2: 'row'}, axis=1)

    fips                 row
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

If you need to join this back to the original, use join or concat:

df.join(df['row'].str.partition(' ')[[0, 2]])

pd.concat([df, df['row'].str.partition(' ')[[0, 2]]], axis=1)

                        row      0                   2
0       00000 UNITED STATES  00000       UNITED STATES
1             01000 ALABAMA  01000             ALABAMA
2  01001 Autauga County, AL  01001  Autauga County, AL
3  01003 Baldwin County, AL  01003  Baldwin County, AL
4  01005 Barbour County, AL  01005  Barbour County, AL

score 3 · Answer 10 · answered Jul 03 '19 at 17:18

3

I saw that no one had used the slice method, so here I put my 2 cents here.

df["<col_name>"].str.slice(stop=5)
df["<col_name>"].str.slice(start=6)

This method will create two new columns.

answered Jul 03 '19 at 17:18

VingtCent

1,874
1
9
10

score 2 · Answer 11 · answered Jun 03 '18 at 13:45

I prefer exporting the corresponding pandas series (i.e. the columns I need), using the apply function to split the column content into multiple series and then join the generated columns to the existing DataFrame. Of course, the source column should be removed.

e.g.

 col1 = df["<col_name>"].apply(<function>)
 col2 = ...
 df = df.join(col1.to_frame(name="<name1>"))
 df = df.join(col2.toframe(name="<name2>"))
 df = df.drop(["<col_name>"], axis=1)

To split two words strings function should be something like that:

lambda x: x.split(" ")[0] # for the first element
lambda x: x.split(" ")[-1] # for the last element

How to split a dataframe string column into two columns?

11 Answers11

TL;DR version:

In detail:

`Series.str.partition`

Linked

Related