0

I am struggling to determine as to why my split function along with code is not working. I have a column like this -

RegionName
Alabama[edit]
Auburn (Auburn University)
Florence(University of North Alabama)
Jacksonville
.
.
.
and so on..

The above entries show the cases that are there in the column. What i want to achieve is for entries having state names such as Alabama[edit], i want to have it displayed as NaN, for remaining other entries which are corresponding regions within that particular State, i want to clean all those entries if required. If no cleaning required, i want that entry to stay intact.i am using below code-

for x in Town['RegionName']:
    if re.match(r"\s*\(",x):
        x.split('(').strip()
    elif re.match(r"\d+\[",x):
        x = np.NaN
    else:
        x

The code runs without any error but all the entries stay intact. The desired output is -

RegionName
NaN
Auburn
Florence
Jacksonville
.
.
.
Cleaning required is - remove the entire content post parenthesis, there could be a space between required content and parenthesis so have to take that as well into account.

Please advise.

Nakul Sharma
  • 143
  • 2
  • 9
  • The `x` in the statement `for x in Town['RegionName']` is only in the scope of the `for` loop; you are just changing a copy not the actual element. – pstatix Apr 13 '18 at 17:41
  • Where is this file from? I can't believe how many times this question has been asked! – pault Apr 13 '18 at 19:00
  • I tried searching for this as i also believed that someone might have asked this question before but couldn't come across one so i had to ask myself here. I regret for any inconvenience caused. – Nakul Sharma Apr 13 '18 at 20:52

3 Answers3

2

You need to set the value back to the list items

for i, x in enumerate(Town['RegionName']):
    # Manipulation of x
    ...
    Town['RegionName'][i] = x
Brendan Abel
  • 35,343
  • 14
  • 88
  • 118
0

Using .apply with lambda and str.split

Demo:

import pandas as pd
import numpy as np

df = pd.DataFrame({"a":["Alabama[edit]", "Auburn (Auburn University)", "Jacksonville"]})
print(df["a"].apply(lambda x: np.nan if "[edit]" in x else x.split("(")[0].strip()))

Output:

0             NaN
1          Auburn
2    Jacksonville
Name: a, dtype: object
Rakesh
  • 81,458
  • 17
  • 76
  • 113
0

Iterating over rows in pandas is discouraged when avoidable because it's slow. Here's a faster, vectorized approach to your problem, using np.where:

Towns["RegionName"] = np.where(
    Towns["RegionName"].str.contains("\[edit\]"),
    np.nan,
    Towns["RegionName"].str.split("(\s)?\(", expand=True)[0]
)
print(Towns["RegionName"])
#0             NaN
#1          Auburn
#2        Florence
#3    Jacksonville
#Name: RegionName, dtype: object

The first argument to np.where is a condition. If the condition is True, the second argument is returned. If it is False, the last argument is returned. For replacing everything including and after the (, I used the answer I posted on this similar question.

pault
  • 41,343
  • 15
  • 107
  • 149
  • Hi. I have one doubt. in .where clause, [0] has been appended at the end of the statement. I verified that in its absence, the shape of the column changes to (567,7) and it throws the error, whereas, the right shape is (567,) which is obtained by the code provided. I want to understand how this change is happening through that [0], if you can please explain some concept behind that. Thanks again for all the help. – Nakul Sharma Apr 14 '18 at 00:08
  • `split(..., expand=True)` will return a multiple columns. `[0]` gets the first one, which is everything before the pattern on which we split. Look at the output of `Towns["RegionName"].str.split("(\s)?\(", expand=True)` by itself an d this will be clearer. – pault Apr 14 '18 at 11:38