You are using \s
that matches any kind of vertical and horizontal whitespaces. If you plan to just match spaces and tabs, replace it with [ \t]
.
Besides, you should consider escaping dots in your pattern (they are all outside of character classes) to match literal dots, else, they match any char but a line break char.
Also, you do not need a capturing group around the whole pattern, you may always get the whole match via Group 0 (that you may access when iterating all match data objects returned with re.finditer
).
So, you may use
\d[\d \t,.]*(?:carpet|sft|feet|sqft|yard|gaj|feet|s\.ft|sq\.ft|sq feet|fq\.ft\.|sq\.ft\.|pt|crpt|ft|sq\.mt\.|sq\.mtr|sq\.mt|plot|sf|sfqt|acer|gj|vigha|anna|gunta|sq|gunthe|guntha|bigha|sqd|sqm|sqyd|area|acre|square|yrd|sq\.yard|sq yd|sq\.yd|sq\. yd\.|gaj|sqt)s?
See the regex demo.
You may use re.findall(pattern, s)
to get all matches as a list. Or, if you need a list of tuples containing specific submatches, wrap those parts with capturing parentheses. E.g., to capture the number to one group and the measurement unit into another, use (\d(?:[\d ,.]*\d)?)[ \t]*((?:carpet|sft|feet|sqft|yard|gaj|feet|s\.ft|sq\.ft|sq feet|fq\.ft\.|sq\.ft\.|pt|crpt|ft|sq\.mt\.|sq\.mtr|sq\.mt|plot|sf|sfqt|acer|gj|vigha|anna|gunta|sq|gunthe|guntha|bigha|sqd|sqm|sqyd|area|acre|square|yrd|sq\.yard|sq yd|sq\.yd|sq\. yd\.|gaj|sqt)s?)
. Note I revamped \d(?:[\d \t,.]*\d)? *
into (\d(?:[\d ,.]*\d)?)[ \t]*
to make sure the spaces after the number are not captured.
Python demo:
import re
s = "kanakiya area 1350 sqft asking price : 95 lacs destination properties azymn - 9920902674 \n plot on rent near sp ring road rajpath club ki gali me road touch 5000 war na 350000 rent owner side no b"
pattern = r'\d[\d ,.]*(?:carpet|sft|feet|sqft|yard|gaj|feet|s\.ft|sq\.ft|sq feet|fq\.ft\.|sq\.ft\.|pt|crpt|ft|sq\.mt\.|sq\.mtr|sq\.mt|plot|sf|sfqt|acer|gj|vigha|anna|gunta|sq|gunthe|guntha|bigha|sqd|sqm|sqyd|area|acre|square|yrd|sq\.yard|sq yd|sq\.yd|sq\. yd\.|gaj|sqt)s?'
print(re.findall(pattern, s))
pattern1 = r'(\d(?:[\d ,.]*\d)?)[ \t]*((?:carpet|sft|feet|sqft|yard|gaj|feet|s\.ft|sq\.ft|sq feet|fq\.ft\.|sq\.ft\.|pt|crpt|ft|sq\.mt\.|sq\.mtr|sq\.mt|plot|sf|sfqt|acer|gj|vigha|anna|gunta|sq|gunthe|guntha|bigha|sqd|sqm|sqyd|area|acre|square|yrd|sq\.yard|sq yd|sq\.yd|sq\. yd\.|gaj|sqt)s?)'
print("Now, with captures:")
for m in re.finditer(pattern1, s):
print("{} => {}".format(m.group(1), m.group(2)))
Output:
['1350 sqft']
Now, with captures:
1350 => sqft