0

I have a table where one column is the ID and the second is the unit number. I need group by ID to extract max floor or max stories in that ID and create an extra column max_floor so that all rows for that ID have the same value of max_floor.

Case 1: As per the unit number, it can start with 1 digit followed by letters or 2 digits followed by letters. In that case, extract digits and find the maximum number to get the floors, as shown in the table below:

enter image description here

Case 2: The unit number starts with 1 or 2 letters, followed by numbers. In that case, the number of unique values of those letters give us the maximum number of stories,as given in the table:

enter image description here

First I need to filter out these cases and then calculate the floors. Need help with both

Update: I'm trying filtering using regex for case 1. This is my code:

if re.search(r'^[0-9]{1,2}[A-Z]+$',df.number.fillna('').astype(str)):
   print(df.number)```

But I'm getting the error: 
```TypeError: expected string or bytes-like object```
  • Please don't post images of code, data, or Tracebacks. Copy and paste it as text then format it as code (select it and type `ctrl-k`) ... [Discourage screenshots of code and/or errors](https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors)...[Why not upload images of code on SO when asking a question?](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question) ... [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – wwii Mar 16 '21 at 19:19
  • `I need to...` - which part are you having trouble with? – wwii Mar 16 '21 at 19:20
  • This is a specific use case, but I think you will be able to sort it out yourself if you search for regex string extract and for "loc" of pandas dataframe. Basically, I would use regex to extract the relevant digits or string, then use loc to filter it as needed. – born_naked Mar 16 '21 at 19:23
  • @wwii both parts – beginnner_python Mar 16 '21 at 19:24
  • https://pandas.pydata.org/docs/user_guide/indexing.html – wwii Mar 16 '21 at 19:35
  • @born_naked can you help with the regex? – beginnner_python Mar 16 '21 at 19:49
  • Using re.match, you could use this pattern to get both `(?:\d{1,2}(?=[A-Z]+$)|[A-Z]{1,2}(?=\d+))` https://regex101.com/r/iDxXpN/1 or use them separately like this to get only digits for example `\d{1,2}(?=[A-Z]+$)` https://regex101.com/r/lDl5Uk/1 – The fourth bird Mar 16 '21 at 21:38

1 Answers1

0

This code snippet defines a function that returns the all the digits found in the input string. As long as you don't input stings that don't have digits it should work, and problems with that can be mitigated using a simple try/except statement.

import re
def extract_digits(input_string):
    return re.findall('\d{1,2}')[0]

The code uses python's regex module to look for exactly 1 or 2 digits in the string. Since the findall() function returns a list, the code accesses the first index because based on the input we expect, the first index of the returned list should always be the matched digits.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Jack H.
  • 115
  • 10