2

I have the following pandas dataframe, only showing one column

0           Atlantic Division
1        Tampa Bay Lightning*
2              Boston Bruins*
3        Toronto Maple Leafs*
4            Florida Panthers
5           Detroit Red Wings
6          Montreal Canadiens
7             Ottawa Senators
8              Buffalo Sabres
9       Metropolitan Division
10       Washington Capitals*
11       Pittsburgh Penguins*
12       Philadelphia Flyers*
13     Columbus Blue Jackets*
14         New Jersey Devils*
15        Carolina Hurricanes
16         New York Islanders
17           New York Rangers
18           Central Division
19       Nashville Predators*
20             Winnipeg Jets*
21            Minnesota Wild*
22        Colorado Avalanche*
23            St. Louis Blues
24               Dallas Stars
25         Chicago Blackhawks
26           Pacific Division
27      Vegas Golden Knights*
28             Anaheim Ducks*
29           San Jose Sharks*
30         Los Angeles Kings*
31             Calgary Flames
32            Edmonton Oilers
33          Vancouver Canucks
34            Arizona Coyotes
35          Atlantic Division
36        Montreal Canadiens*
37           Ottawa Senators*
38             Boston Bruins*
39       Toronto Maple Leafs*
40        Tampa Bay Lightning
41           Florida Panthers
42          Detroit Red Wings
43             Buffalo Sabres
44      Metropolitan Division
45       Washington Capitals*
46       Pittsburgh Penguins*
47     Columbus Blue Jackets*
48          New York Rangers*
49         New York Islanders
50        Philadelphia Flyers
51        Carolina Hurricanes
52          New Jersey Devils
53           Central Division
54        Chicago Blackhawks*
55            Minnesota Wild*
56           St. Louis Blues*
57       Nashville Predators*
58              Winnipeg Jets
59               Dallas Stars
60         Colorado Avalanche
61           Pacific Division
62             Anaheim Ducks*
63           Edmonton Oilers*
64           San Jose Sharks*
65            Calgary Flames*
66          Los Angeles Kings
67            Arizona Coyotes
68          Vancouver Canucks
69          Atlantic Division
70          Florida Panthers*
71       Tampa Bay Lightning*
72         Detroit Red Wings*
73              Boston Bruins
74            Ottawa Senators
75         Montreal Canadiens
76             Buffalo Sabres
77        Toronto Maple Leafs
78      Metropolitan Division
79       Washington Capitals*
80       Pittsburgh Penguins*
81          New York Rangers*
82        New York Islanders*
83       Philadelphia Flyers*
84        Carolina Hurricanes
85          New Jersey Devils
86      Columbus Blue Jackets
87           Central Division
88              Dallas Stars*
89           St. Louis Blues*
90        Chicago Blackhawks*
91       Nashville Predators*
92            Minnesota Wild*
93         Colorado Avalanche
94              Winnipeg Jets
95           Pacific Division
96             Anaheim Ducks*
97         Los Angeles Kings*
98           San Jose Sharks*
99            Arizona Coyotes
100            Calgary Flames
101         Vancouver Canucks
102           Edmonton Oilers
103         Atlantic Division
104       Montreal Canadiens*
105      Tampa Bay Lightning*
106        Detroit Red Wings*
107          Ottawa Senators*
108             Boston Bruins
109          Florida Panthers
110       Toronto Maple Leafs
111            Buffalo Sabres
112     Metropolitan Division
113         New York Rangers*
114      Washington Capitals*
115       New York Islanders*
116      Pittsburgh Penguins*
117     Columbus Blue Jackets
118       Philadelphia Flyers
119         New Jersey Devils
120       Carolina Hurricanes
121          Central Division
122          St. Louis Blues*
123      Nashville Predators*
124       Chicago Blackhawks*
125           Minnesota Wild*
126            Winnipeg Jets*
127              Dallas Stars
128        Colorado Avalanche
129          Pacific Division
130            Anaheim Ducks*
131        Vancouver Canucks*
132           Calgary Flames*
133         Los Angeles Kings
134           San Jose Sharks
135           Edmonton Oilers
136           Arizona Coyotes
137         Atlantic Division
138            Boston Bruins*
139      Tampa Bay Lightning*
140       Montreal Canadiens*
141        Detroit Red Wings*
142           Ottawa Senators
143       Toronto Maple Leafs
144          Florida Panthers
145            Buffalo Sabres
146     Metropolitan Division
147      Pittsburgh Penguins*
148         New York Rangers*
149      Philadelphia Flyers*
150    Columbus Blue Jackets*
151       Washington Capitals
152         New Jersey Devils
153       Carolina Hurricanes
154        New York Islanders
155          Central Division
156       Colorado Avalanche*
157          St. Louis Blues*
158       Chicago Blackhawks*
159           Minnesota Wild*
160             Dallas Stars*
161       Nashville Predators
162             Winnipeg Jets
163          Pacific Division
164            Anaheim Ducks*
165          San Jose Sharks*
166        Los Angeles Kings*
167           Phoenix Coyotes
168         Vancouver Canucks
169            Calgary Flames
170           Edmonton Oilers
Name: team, dtype: object

I need to create one additional column with the city name.

At first look the regex would be simple (the first word) should be the city name, and the rest is the team name.

However some cities have 2 words (Los Angeles, St Louis ,etc)

Is there a possibility to do this with regex or it has to be done manually?

Update: I tried the following:

nhl_df['city']=nhl_df['team'].str.extract(r'^(?:([\w.]{1,5}\s\w+)|(\w+)|)(?:\s\w+)+\*?$')

But I get this error:

ValueError: Wrong number of items passed 2, placement implies 1
Luis Valencia
  • 32,619
  • 93
  • 286
  • 506
  • 1
    Probably for most of them you can remove the last word. But then there's `Central Division` etc., what would the city be for those ones? – perl Mar 09 '21 at 08:51

4 Answers4

2

You can try something like that:

^(?:([\w.]{1,5}\s\w+)|(\w+)|)(?:\s\w+)+\*?$

Here you should look for city name in first or second group.

This pattern uses assumption that first part of two-word city names has no more than 5 symbols. The result might not be so clean, but seems to work fine on given example.

  • 1
    I tried this: nhl_df['city']=nhl_df['team'].str.extract(r'^(?:([\w.]{1,5}\s\w+)|(\w+)|)(?:\s\w+)+\*?$') but I get this error: ValueError: Wrong number of items passed 2, placement implies 1 – Luis Valencia Mar 09 '21 at 11:04
  • I think this could helps: https://stackoverflow.com/questions/43196907/valueerror-wrong-number-of-items-passed-meaning-and-suggestions – Vladimir Trifonov Mar 09 '21 at 11:32
1

^\S+(?=\s\S+$)

This regex gives you the first word of all teamnames that only consist of two words. The others you have to sort manually, because there is no way to tell just by pattern if the middle word is part of the city or the teamname.

alf_gralf
  • 31
  • 5
1

You can use

^([\w.]{1,5}(?:\s\w+)?\w*)

See the regex demo. Details:

  • ^ - start of string
  • ([\w.]{1,5}(?:\s\w+)?\w*) - Capturing group 1:
    • [\w.]{1,5} - one to five word or dot chars
    • (?:\s\w+)? - an optional occurrence of a whitespace and then one or more word chars
    • \w* - zero or more word chars.

Pandas test:

import pandas as pd
nhl_df = pd.DataFrame({"team":["Atlantic Division","Tampa Bay Lightning*","Boston Bruins*","Toronto Maple Leafs*","Florida Panthers","Detroit Red Wings","Montreal Canadiens","Ottawa Senators","Buffalo Sabres","Metropolitan Division","Washington Capitals*","Pittsburgh Penguins*","Philadelphia Flyers*","Columbus Blue Jackets*","New Jersey Devils*","Carolina Hurricanes","New York Islanders","New York Rangers","Central Division","Nashville Predators*","Winnipeg Jets*","Minnesota Wild*","Colorado Avalanche*","St. Louis Blues","Dallas Stars","Chicago Blackhawks","Pacific Division","Vegas Golden Knights*","Anaheim Ducks*","San Jose Sharks*","Los Angeles Kings*","Calgary Flames","Edmonton Oilers","Vancouver Canucks","Arizona Coyotes","Atlantic Division","Montreal Canadiens*","Ottawa Senators*","Boston Bruins*","Toronto Maple Leafs*","Tampa Bay Lightning","Florida Panthers","Detroit Red Wings","Buffalo Sabres","Metropolitan Division","Washington Capitals*","Pittsburgh Penguins*","Columbus Blue Jackets*","New York Rangers*","New York Islanders","Philadelphia Flyers","Carolina Hurricanes","New Jersey Devils","Central Division","Chicago Blackhawks*","Minnesota Wild*","St. Louis Blues*","Nashville Predators*","Winnipeg Jets","Dallas Stars","Colorado Avalanche","Pacific Division","Anaheim Ducks*","Edmonton Oilers*","San Jose Sharks*","Calgary Flames*","Los Angeles Kings","Arizona Coyotes","Vancouver Canucks","Atlantic Division","Florida Panthers*","Tampa Bay Lightning*","Detroit Red Wings*","Boston Bruins","Ottawa Senators","Montreal Canadiens","Buffalo Sabres","Toronto Maple Leafs","Metropolitan Division","Washington Capitals*","Pittsburgh Penguins*","New York Rangers*","New York Islanders*","Philadelphia Flyers*","Carolina Hurricanes","New Jersey Devils","Columbus Blue Jackets","Central Division","Dallas Stars*","St. Louis Blues*","Chicago Blackhawks*","Nashville Predators*","Minnesota Wild*","Colorado Avalanche","Winnipeg Jets","Pacific Division","Anaheim Ducks*","Los Angeles Kings*","San Jose Sharks*","Arizona Coyotes","Calgary Flames","Vancouver Canucks","Edmonton Oilers","Atlantic Division","Montreal Canadiens*","Tampa Bay Lightning*","Detroit Red Wings*","Ottawa Senators*","Boston Bruins","Florida Panthers","Toronto Maple Leafs","Buffalo Sabres","Metropolitan Division","New York Rangers*","Washington Capitals*","New York Islanders*","Pittsburgh Penguins*","Columbus Blue Jackets","Philadelphia Flyers","New Jersey Devils","Carolina Hurricanes","Central Division","St. Louis Blues*","Nashville Predators*","Chicago Blackhawks*","Minnesota Wild*","Winnipeg Jets*","Dallas Stars","Colorado Avalanche","Pacific Division","Anaheim Ducks*","Vancouver Canucks*","Calgary Flames*","Los Angeles Kings","San Jose Sharks","Edmonton Oilers","Arizona Coyotes","Atlantic Division","Boston Bruins*","Tampa Bay Lightning*","Montreal Canadiens*","Detroit Red Wings*","Ottawa Senators","Toronto Maple Leafs","Florida Panthers","Buffalo Sabres","Metropolitan Division","Pittsburgh Penguins*","New York Rangers*","Philadelphia Flyers*","Columbus Blue Jackets*","Washington Capitals","New Jersey Devils","Carolina Hurricanes","New York Islanders","Central Division","Colorado Avalanche*","St. Louis Blues*","Chicago Blackhawks*","Minnesota Wild*","Dallas Stars*","Nashville Predators","Winnipeg Jets","Pacific Division","Anaheim Ducks*","San Jose Sharks*","Los Angeles Kings*","Phoenix Coyotes","Vancouver Canucks","Calgary Flames","Edmonton Oilers"]})
nhl_df['city']=nhl_df['team'].str.extract(r'^([\w.]{1,5}(?:\s\w+)?\w*)')
>>> nhl_df
                     team         city
0       Atlantic Division     Atlantic
1    Tampa Bay Lightning*    Tampa Bay
2          Boston Bruins*       Boston
3    Toronto Maple Leafs*      Toronto
4        Florida Panthers      Florida
..                    ...          ...
166    Los Angeles Kings*  Los Angeles
167       Phoenix Coyotes      Phoenix
168     Vancouver Canucks    Vancouver
169        Calgary Flames      Calgary
170       Edmonton Oilers     Edmonton
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • it works perfect, except for one match, but I guess I can replace that manually: Vegas Golden Knights* Vegas Golden – Luis Valencia Mar 09 '21 at 11:40
  • @LuisValencia I was just following the logic in the regex pattern you posted in the question. If you want to implement any change, please provide a requirement for this. Try with an exception for `Vegas` then, `^(Vegas|[\w.]{1,5}(?: \w+)?\w*)` – Wiktor Stribiżew Mar 09 '21 at 11:44
0

Try using the below regex

/(\d*\s*)([a-zA-Z\s]*)(\s)(\b([a-zA-z\*]*))$/

Checkthis

function Replace(str) {
    var result = str.replace(/(\d*\s*)([a-zA-Z\s]*)(\s)(\b([a-zA-z\*]*))$/gim, function (a, $1, $2, $3, $4) { 
    return `${$2}--${$4}`;
}); 
  return result;
  }
Jobelle
  • 2,717
  • 1
  • 15
  • 26