1

I have a dataframe. One of the columns is a combination of CITY and STATE. I want to split this column to two columns, CITY and STATE using:

df['CITY'],df['STATE'] = df['WORKSITE'].str.split(",")

And I got this error:

ValueError Traceback (most recent call last) in () ----> 1 df['CITY'],df['STATE'] = df['WORKSITE'].str.split(",")

ValueError: too many values to unpack (expected 2)

So, I'm wondering is there a method that I can ignore the exceptions or detect which row is not working?

fmw42
  • 46,825
  • 10
  • 62
  • 80
Kyle Chen
  • 19
  • 4
  • I'm not strongly familiar with Python, but I expect (based on my Perl sensibilities) `split` isn't really going to be the solution you need, because it will generate an array of varying length on the RHS, while your LHS is always an array of length 2. – Brian A. Henning Apr 21 '17 at 17:38
  • You may be better-served to find the position of the first comma, then use substring methods to get the prefix and postfix. – Brian A. Henning Apr 21 '17 at 17:39
  • [`.str.split()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html) takes a parameter `n` which limits the number of splits, so `df['WORKSITE'].str.split(",", n=1)` should do what you want. – Craig Apr 21 '17 at 17:45
  • @Craig I've tried that and still not working. – Kyle Chen Apr 21 '17 at 17:49

1 Answers1

3

Set n=2 in the split call and use the str method effectively:

import pandas as pd

x = ['New York, NY', 'Los Angeles, CA', 'Kansas City, KS, US']

df = pd.DataFrame(x, columns=['WORKSITE'])

df['CITY'], df['STATE'] = df['WORKSITE'].str.split(',', 2).str[0:2].str

print df

Output

              WORKSITE         CITY STATE
0         New York, NY     New York    NY
1      Los Angeles, CA  Los Angeles    CA
2  Kansas City, KS, US  Kansas City    KS

I got some help from looking at this answer to this question.

Community
  • 1
  • 1
  • This is great. I really appreciate your help! – Kyle Chen Apr 22 '17 at 02:55
  • Happy that my answer inspired this one. I'd just add that, in this case, the `n=2` to the split is not strictly necessary (although it's good for performance in case there are pathological cases with lots of excessive commas). On the other hand, neither `n=2` nor the slice notation protects against a `TypeError` if there are **less** commas than necessary. In that case the developer would have to first split, then use the `.str.len()` method to create a column which would be used to filter rows with too few splits. – LeoRochael Jan 11 '18 at 20:41