Extract certain integers from string value, of different length, which contains unwanted integers. Pattern or Position

Question

I am somewhat of a beginner programmer and am looking for help and an explanation of a problem. I am looking to extract the ID numbers from a string into new column, then fill in missing numbers.

I am working with pandas dataframe and I have the following set of street names, some with an ID number and others missing:

*Start station*:
"19th & L St (31224)"
"14th & R St NW (31202)"
"Paul Rd & Pl NW (31602)"
"14th & R St NW"
"19th & L St"
"Paul Rd & Pl NW"

My desired outcome:
*Start station*         *StartStatNum*
"14th & R St NW"        31202
"19th & L St"           31224
"Paul Rd & Pl NW"       31602
"14th & R St NW"        31202
"19th & L St"           31224
"Paul Rd & Pl NW"       31602

I am having difficulty after my first step of splitting. I can split based on position with the following:

def Stat_Num(Stat_Num):
    return Stat_Num.split('(')[-1].split(')')[0].strip()

db["StartStatNum"] = pd.DataFrame({'Num':db['Start station'].apply(Stat_Num)})

But this gives:
*Start station*         *StartStatNum*
"19th & L St (31224)"        31202
"14th & R St NW (31202)"     31224
"Paul Rd & Pl NW (31602)"    31602
"14th & R St NW"            "14th & R St NW"
"19th & L St"               "19th & L St"
"Paul Rd & Pl NW"           "Paul Rd & Pl NW"

The problem would then arise when I want to find/fill StartStatNum with the station ID numbers that I don't have.

I have been trying to get to know str.extract, str.contains, re.findall and tried the following as a possible stepping stone:

db['Start_S2']  = db['Start_Stat_Num'].str.extract(" ((\d+))")
db['Start_S2']  = db['Start station'].str.contains(" ((\d+))")
db['Start_S2']  = db['Start station'].re.findall(" ((\d+))")

I have also tried the following this from here

def parseIntegers(mixedList):
return [x for x in db['Start station'] if (isinstance(x, int) or isinstance(x, long)) and not isinstance(x, bool)]

However when I pass values in, I get a list 'x' with 1 value. As a bit of a noob, I don't think going the pattern route is best as it will also take in unwanted integers (although I could possibly turn to Nan's as they would be less than 30000 (the lowest value for ID number) I also have an idea that it could be something simple that I'm overlooking, but after about 20 straight hours and a lot of searching, I am at a bit of a loss.

Any help would be extremely helpful.

score 1 · Answer 1 · answered May 14 '15 at 17:57

A solution could be to create a dataframe with the transformation

station -> id

like

l = ["19th & L St (31224)",
    "14th & R St NW (31202)",
    "Paul Rd & Pl NW (31602)",
    "14th & R St NW",
    "19th & L St",
    "Paul Rd & Pl NW",]

df = pd.DataFrame( {"station":l})
df_dict = df['station'].str.extract("(?P<station_name>.*)\((?P<id>\d+)\)").dropna()
print df_dict

 # result:
       station_name     id
 0      19th & L St   31224
 1   14th & R St NW   31202
 2  Paul Rd & Pl NW   31602
 [3 rows x 2 columns]

Starting from there, you can use some list comprehension:

l2 = [ [row["station_name"], row["id"]]
       for line in l
       for k,row in df_dict.iterrows()
       if row["station_name"].strip() in line]

to get:

 [['19th & L St ', '31224'], 
  ['14th & R St NW ', '31202'], 
  ['Paul Rd & Pl NW ', '31602'], 
  ['14th & R St NW ', '31202'], 
  ['19th & L St ', '31224'], 
  ['Paul Rd & Pl NW ', '31602']]

I let you transform the later in dataframe...

There might be nicer solutions for the last part at least...

score 1 · Accepted Answer · answered May 14 '15 at 18:17

1

Here's a way that worked for me, firstly extract the numbers in the braces:

In [71]:

df['start stat num'] = df['Start station'].str.findall(r'\((\d+)\)').str[0]
df
Out[71]:
             Start station start stat num
0      19th & L St (31224)          31224
1   14th & R St NW (31202)          31202
2  Paul Rd & Pl NW (31602)          31602
3           14th & R St NW            NaN
4              19th & L St            NaN
5          Paul Rd & Pl NW            NaN

Now remove the number as we don't need it anymore:

In [72]:

df['Start station'] = df['Start station'].str.split(' \(').str[0]
df
Out[72]:
     Start station start stat num
0      19th & L St          31224
1   14th & R St NW          31202
2  Paul Rd & Pl NW          31602
3   14th & R St NW            NaN
4      19th & L St            NaN
5  Paul Rd & Pl NW            NaN

Now we can fill in the missing station number by calling map on the df with the NaN rows removed, and the station name set as the index, this will lookup the station name and return the station number:

In [73]:

df['start stat num'] = df['Start station'].map(df.dropna().set_index('Start station')['start stat num'])
df
Out[73]:
     Start station start stat num
0      19th & L St          31224
1   14th & R St NW          31202
2  Paul Rd & Pl NW          31602
3   14th & R St NW          31202
4      19th & L St          31224
5  Paul Rd & Pl NW          31602

answered May 14 '15 at 18:17

EdChum

376,765
198
813
562

Thank you @EdChum, The first 2 I have got to work. Filling in the NaN's is giving me an error : `InvalidIndexError: Reindexing only valid with uniquely valued Index objects`. Is this as taking the string as the index it may be slightly different to another with same ID number? Also, if I could check, in the 2nd line `str.split(' \(').str[0]`. I can understand the split on the ')' but what part does the '\' play?. Thank you. – Ken Lawlor May 14 '15 at 18:47
I don't understand your first comment, with regards to the split the `\` is needed to escape the braces so what I'm doing is splitting on ' (' so single space open brace – EdChum May 14 '15 at 18:50
When I run the 3 section (to fill NaN's) on my dataset I get the `InvalidIndexError: Reindexing only valid with uniquely valued Index objects`. So my thinking is that there may be a street of eg. `Paul Rd & Pl NW` : `31202`, but also another row with same street bit slightly different: eg. `paul Rd & Pl Nw` : `31202`. Would I need to strip out white space and put to lower case the Start station. hope that makes a little better sence. @EdChum – Ken Lawlor May 14 '15 at 19:06
Yes you'd have to clean and extract the data before calling `map` – EdChum May 14 '15 at 19:08
Sorry: In other words, setting 'Start station' as index, there may be same values but spelt slightly different (but having same ID number). Therefore not making them unique – Ken Lawlor May 14 '15 at 19:09
You'd have to merge in that case something like `df.merge(df.dropna(), on='Start station', how='left')` I think should work – EdChum May 14 '15 at 19:11
I have marked as answered, but just for clarity for anyone stumbling upon this question. I had to do something slightly different due to the data. the street names in my data were entered inconsistently. i.e. 'Paul Rd & Pl NW' and 'PI & Paul Rd NW'. This caused problems with merge and grouping. To overcome this, I was forced to drop some instances and merge with a separate file containing proper station names (merged on station name after cleaning, white space strip, and putting to lower case.) Thanks to all for the help. – Ken Lawlor May 27 '15 at 12:39

Extract certain integers from string value, of different length, which contains unwanted integers. Pattern or Position

2 Answers2