0

I have a dataset of crimes reported by Gloucestershire Constabulary from 2011-16. It's a .csv file that I have imported to a Pandas dataframe. The data include a column stating the Lower Super Output Area (LSOA) in which the crime occurred, so for crimes in Tewkesbury, for instance, each record has the corresponding LSOA name, e.g. 'Tewkesbury 009D'; 'Tewkesbury 009E'.

I want to group these data by the town/city they relate to, e.g. 'Gloucester', 'Tewkesbury', ignoring the specific LSOAs within each conurbation. Ideally, I would append a new column to the dataframe, with just the place name copied across, and group on that. I am comfortable with how to do the grouping, just not the new column in the first place. Any advice on how to do this is gratefully received.

the_bonze
  • 325
  • 1
  • 4
  • 11

3 Answers3

1

Something along these lines should work:

df['town'] = [x.split()[0] for x in df['LSOA']]
Ezer K
  • 3,637
  • 3
  • 18
  • 34
  • 1
    You don't need the list comprehension: `df['town'] = df.LSOA.str.split().str[0]` – andrew_reece Apr 22 '17 at 20:26
  • @andrew_reece - this worked, thank you. I get a "SetWithCopyWarning": "A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead" but the data look OK regardless? – the_bonze Apr 23 '17 at 09:23
  • This usually means `df` is a view or chained index from another data frame you made earlier. There are [a few ways](http://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas) to resolve this issue, using `.loc` is one of them., using `copy()` is another. Often, you can get away with ignoring this warning, although there are [instances](http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy) where this will get you in trouble. – andrew_reece Apr 23 '17 at 15:09
0

I am no Pandas expert but I think you can do string slicing to strip out the last five digits (it supports regex too if I recall correctly, so you can do a proper 'search' if required).

#x is the original dataframe
new_col = x.lsoa.str[:-5]    #lsoa is the column containing city names
pd.concat([x, new_col], axis=1)

The str method can be used to extract a string out of the lsoa column of the dataframe.

Abid Hasan
  • 648
  • 4
  • 10
0

You can use regex to extract the city name from the DataFrame and then join the result to the original DataFrame. If your inital DataFrame is df

df = pd.DataFrame([ 'Tewkesbury 009D', 'Tewkesbury 009E'], columns=['LSOA'])
In [2]: df
Out[2]: 
              LSOA
0  Tewkesbury 009D
1  Tewkesbury 009E

Then you can extract the city name and optionally the LSOA code in to a new DataFrame df_new

df_new = df['LSOA'].str.extract('(\w*)\s(\d+\w*)', expand=True)

In [10]: df_new
Out[10]: 
            0     1
0  Tewkesbury  009D
1  Tewkesbury  009E

If you want to discard the code and just keep the city name remove the second bracket from the regex as '(\w*)\s\d+\w*' . Now you can append this result to the original DataFrame

In [11]: df.join(df_new)
Out[11]: 
              LSOA           0     1
0  Tewkesbury 009D  Tewkesbury  009D
1  Tewkesbury 009E  Tewkesbury  009E
hashmuke
  • 3,075
  • 2
  • 18
  • 29