-1

I'm working with the following data in Pandas. For the block column, I need to change each value so it only includes the street name (this way I can geocode for the lat long coordinates). To use the geocoder I'm working with, I also need to include "Washington, DC".

crimes = pd.read_csv("/content/SearchResults (2).txt", encoding='latin-1') enter image description here

This is what I wan't the BLOCK columns to look like:

2ND STREET SE, WASHINGTON DC

TAYLOR STREET NE, WASHINGTON DC

How do I do this? If it's easier, I can add another column with this info instead of changing the block columns. Apparently you can't use string methods on a pd data frame and I'm clueless when it comes to regular expressions ... please help!

Edit:

this code does exactly what I want:

for i in crimes['BLOCK']:
  i = i.split()
  i = i[-3:]
  i = " ".join([str(elem) for elem in i])
  i = i + ", WASHINGTON DC "
  print(i)

the output looks like this:

MINNESOTA AVENUE NE, WASHINGTON DC 
MORSE STREET NE, WASHINGTON DC 

How do I reassign the actual column values to the i variable above?

Edit 2:

Here is an example of the csv file:

REPORT_DAT,OFFENSE,METHOD,BLOCK,DISTRICT,WARD,NEIGHBORHOOD_CLUSTER,BLOCK_GROUP,XBLOCK,YBLOCK,START_DATE
6/30/2020 3:03:21 AM,THEFT F/AUTO,OTHERS,5700  - 5799 BLOCK OF 27TH STREET NW,2,4,Cluster 10,001500 1,395132,144513,6/29/2020 2:00:48 PM
6/30/2020 12:04:33 AM,MOTOR VEHICLE THEFT,OTHERS,4432 - 4499 BLOCK OF GREENWICH PARKWAY NW,2,3,Cluster 13,000802 2,392727,138206,6/29/2020 1:00:43 PM 
furas
  • 134,197
  • 12
  • 106
  • 148
  • 2
    Just enclose data and expected output in code please. Pictures are hard to work with :-) – ipj Jul 30 '20 at 20:11
  • Data is just a massive CSV file. The screenshot is output from crimes.head(). To index that column, I can do crimes['BLOCK']. Is there a way to upload the data without just pasting a giant csv file? Thanks for your help :) – SoftEngStudent Jul 30 '20 at 20:16
  • 2
    You can just post a few rows and a few columns -- no need to post an entire CSV file. More info here: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – jsmart Jul 30 '20 at 20:19
  • Here's 2 lines of the csv file: REPORT_DAT,SHIFT,OFFENSE,METHOD,BLOCK 6/30/2020 3:03:21 AM,MIDNIGHT,THEFT F/AUTO,OTHERS,5700 - 5799 BLOCK OF 27TH STREET NW 6/30/2020 12:04:33 AM,MIDNIGHT,MOTOR VEHICLE THEFT,OTHERS,4432 - 4499 BLOCK OF GREENWICH PARKWAY NW – SoftEngStudent Jul 30 '20 at 20:24
  • the headers end after BLOCK – SoftEngStudent Jul 30 '20 at 20:26
  • better put some example data as text which we can use to create and test solution. And put it in question, not in comment. It will be more readable and more people will see it so more people may try to help you. – furas Jul 30 '20 at 20:49

1 Answers1

2

I don't know what you tried but I have no problem to use string methods built-in in pandas

df['BLOCK'] = df['BLOCK'].str.split('OF').str[1] + ', WASHINGTON DC'

Minimal working code

text ='''REPORT_DAT,SHIFT,OFFENSE,METHOD,BLOCK
6/30/2020 3:03:21 AM,MIDNIGHT,THEFT F/AUTO,OTHERS,5700 - 5799 BLOCK OF 27TH STREET NW
6/30/2020 12:04:33 AM,MIDNIGHT,MOTOR VEHICLE THEFT,OTHERS,4432 - 4499 BLOCK OF GREENWICH PARKWAY NW'''

import pandas as pd
import io

df = pd.read_csv(io.StringIO(text))

print('--- before ---')
print(df['BLOCK'])

df['BLOCK'] = df['BLOCK'].str.split('OF').str[1] + ', WASHINGTON DC'

print('--- after ---')
print(df['BLOCK'])

Result

--- before ---
0          5700 - 5799 BLOCK OF 27TH STREET NW
1    4432 - 4499 BLOCK OF GREENWICH PARKWAY NW
Name: BLOCK, dtype: object

--- after ---
0           27TH STREET NW, WASHINGTON DC
1     GREENWICH PARKWAY NW, WASHINGTON DC
Name: BLOCK, dtype: object

BTW: pandas uses own string functions which you can't find in normal string functions - .str.contains(). And some of them can be rebuild - ie. .str.replace() can use regex.


BTW: You can also use .apply() and then you use standard string functions

df['BLOCK'] = df['BLOCK'].apply(lambda text: text.split('OF')[1] + ', WASHINGTON DC')

or

def convert(text):
    return text.split('OF')[1] + ', WASHINGTON DC'
    
df['BLOCK'] = df['BLOCK'].apply(convert)

and then you can use more complex code inside convert() - ie. you can easily use if/else

furas
  • 134,197
  • 12
  • 106
  • 148
  • This did the trick so easily - thank you SO SO MUCH!!!!! I must have made a careless error because you're right string methods work just fine – SoftEngStudent Jul 30 '20 at 21:03
  • to work with string you may need to use `.str`. And if you need to use two string functions then you may need to use `.str` two times. BTW: pandas has also string functions which you can't find in normal string functions - `.str.contains()`. And some of them can be rebuild - ie. `.str.replace()` can use regex. – furas Jul 30 '20 at 21:10
  • I added example with `.apply()` – furas Jul 30 '20 at 21:17
  • Thank you! I ended up using your apply() method to create a column of my geolocations. `homicides = crimes[crimes.OFFENSE == 'HOMICIDE'] def convert(street): return geolocator.geocode(street) homicides['LOC'] = homicides['STREET'].apply(convert) homicides.head()` It works perfectly! Even though the output is as expected, I get this warning: `SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead` You've already done more than enough but since you seem to know the functio – SoftEngStudent Jul 30 '20 at 22:20
  • do you have any idea if I need to worry about it or not? Not an error , just warning. It may have something to fo with my geocoder - I'll post if I find a solution :) Thanks again! – SoftEngStudent Jul 30 '20 at 22:22
  • what if you use directly `crimes[crimes.OFFENSE == 'HOMICIDE']['LOC'] = crimes[crimes.OFFENSE == 'HOMICIDE']['STREET'].apply(convert) ` ? – furas Jul 30 '20 at 22:29
  • This works as well but still throws the same warning. I believe this is because I removed columns from the original dataset before doing that so maybe it's reminding me that the original set won't be altered. This is necessary for me though since I'm using colab which can get so slow with too much data. Thanks again for your help!! – SoftEngStudent Jul 30 '20 at 22:37
  • you can also try to use `.loc[]` from warning - probably `crimes.loc[crimes.OFFENSE == 'HOMICIDE' , 'LOC'] = crimes[crimes.OFFENSE == 'HOMICIDE']['STREET'].apply(convert)`. Maybe previous version will assing in different instance and finally you may get `crimes` without changes. – furas Jul 30 '20 at 23:11