0

Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career. As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.

That said here is my problem:

Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.

Goal(s)

Main goal I need to skip certain values in my ID column. The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.

After that I'll start to work on comparing multiple sheets.

  • Main data frame IDs is formatted as 000-000-000-000
  • The other data frames that I will compare it to have it with no dashes "-" as 000000000 and three less 000's totaling nine digits.

The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.

My code that I will use on each ID except the unique ones:

 dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]

but I want to use an if statement like (This does not work)

lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]

if ~dfSS["ID"].isin(lst ).any()
    dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
    pass

For more clarification my input DataFrame is this:

            ID               Street #   Street Name 
0   004-330-002-000         2272        Narnia  
1   021-521-410-000_128     2311        Narnia  
2   001-243-313-000         2235        Narnia  
3   002-730-032-000         2149        Narnia
4   000-000-000_a           1234        Narnia

And I am looking to do this as the output:

            ID               Street #   Street Name 
0   004330002               2272        Narnia  
1   021-521-410-000_128     2311        Narnia  
2   001243313000            2235        Narnia  
3   002730032000            2149        Narnia
4   000-000-000_a           1234        Narnia

Notes:

  • dfSS is my Dataframe variable name aka the excel I am using. "ID" is my column heading. Will make this an index after the fact
  • My Data frame on this job is small with # of (rows, columns) as (2500, 125)
  • I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.

Here is where I have been to research this:

JQTs
  • 142
  • 2
  • 11
  • you may want to check out `np.where`, it's makes doing conditional updates pretty easy – Chris Jun 10 '21 at 21:19
  • @Chris Dang I was putting off numpy learning till I could start actually compare my excel sheets. Guess I just need to bite that bullet a little. Will read more on that np.where tonight. Thank you. – JQTs Jun 10 '21 at 21:47

2 Answers2

1

There are a number of ways to do this. The first way here doesn't involve writing a function.

# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing

The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.

def transform_ID(ID_val):
    if ID_val not in lst:
        return ID_val.replace("-", "")[:9]

dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)
xyzjayne
  • 1,331
  • 9
  • 25
  • @xyzjane Can you help me on understanding this line of code 'df.loc[~df["ID"].isin(uniqueID), "APN"]' ? specifically the left side of the comma '~df["ID"]'.isin part ? ID is not a row and after reading through the documentation I still can understand how that works. – JQTs Jun 17 '21 at 23:14
  • 1
    "ID" is a column and df['ID'] is retrieving that column. ``df['ID'].isin(lst)`` is a list of True/False indicating whether each element in the column is in the list called ``lst``. ``~`` is logical no, so ``~df['ID'].isin(lst)`` return True if it's not in the list and False otherwise. – xyzjayne Jun 18 '21 at 21:26
  • That makes so much sense. I knew I was missing some key concept. Thank you !! – JQTs Jun 18 '21 at 22:37
0

This is based on @xyzxyzjayne answers but I have two issues I can not figure out.

First issue

is I get this warning: (see Edit)

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

Documentation for this warning

You'll see in the code below I tried to put in .loc but I can't seem to find out how to eliminate this warning by using .loc correctly. Still learning it. NO, I will not just ignore it even though it works. This is a learning opportunity I say.

Second issue

is that I do not under stand this part of the code. I know the left side is supposed to be rows, and the right side is columns. That said why does this work? ID is a column not a row when this code is rune. I make the ID :

df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]

Area I don't understand yet, is the left side of the comma(,) on this part:

df.loc[~df["ID "].isin(uniqueID), "ID "]

That said here is the final result, basically as I Said its XZY's help that got me here but I'm adding more .locs and playing with the documentation till I can eliminate the warning.

    uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
 will go in the below code. These ids get skipped. example: "032-234-987_#4256"]

# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
         'Number of Vehicles Removed', 'County']]

#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]

#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]

#Makes the ID our index
df = df.set_index("ID ")

#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")

#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")

Will edit this once i get rid of the warning and figure out the left side so I can explain to for everyone who needs/sees this post.

Edit: SettingWithCopyWarning:

Fixed this chained index problem by making a copy of the orginal data base before filter and making everthing .loc as XYZ has helped me with. Before we start to filter use DataFrame.copy() where DataFrame is the name of your own dataframe.

JQTs
  • 142
  • 2
  • 11