0

I am trying following regex with text below.

Regex:

(\d+[\s\,\d.]*\s*(carpet|sft|feet|sqft|yard|gaj|feet|s.ft|sq.ft|sq feet|fq.ft.|sq.ft.
|pt|crpt|ft|sq.mt.|sq.mtr|sq.mt|plot|sf|sfqt|acer|gj|vigha|anna|gunta|sq|
gunthe|guntha|bigha|sqd|sqm|sqyd|area|acre|square|yrd|
sq.yard|sq yd|sq.yd|sq. yd.|gaj|sqt)s?)

Input text:

kanakiya area 1350     sqft asking price : 95 lacs destination properties azymn - 9920902674 
 plot on rent near sp ring road rajpath club ki gali me road touch 5000 war na 350000 rent owner side no b

It is matching all the required string correctly but it also matching 9920902674
plot

I don't want match words in the text which are separated by a new line.

You can Compile above regex to understand better. How can we not include new line between matches .Want to match words having spaces between words.

Thanks

PS: I have changed this question from previous questions as it was not well received and my accounts was closed. So trying to improve the questions to unlock the accounts.

Please ignore previous answer and comments.

iamabhaykmr
  • 1,803
  • 3
  • 24
  • 49
  • 1
    I'd advise to split the regex into 2 alternatives, `\s*()|()\s*`. Something like [`(\d[. \d\t]*)(?:pkg\b|k\b|lac\.|lakh\.|crore\.|cr\.|l\b)|\b(?:rent|rs)\.\s*(\d[. \d\t]*)`](https://regex101.com/r/xsDcQ9/1). See [this Python demo, too](https://ideone.com/Dpt0BE). – Wiktor Stribiżew Aug 21 '18 at 12:08
  • 1
    Try https://regex101.com/r/ziAOMw/3 – revo Aug 21 '18 at 12:12
  • 1
    You might not need regex. Here is a better way: (1) create dict with all currency types (2) split the input text and look to the left of currency types. – rodcoelho Aug 21 '18 at 12:13
  • 1
    Based on what you really want you could go with `(rent|rs)?([\s.]*\d+[\s\d.]*)(pkg|k|(?:la(?:c|kh)|crore|cr)s?|l)` too. See live demo here https://regex101.com/r/ziAOMw/4 – revo Aug 21 '18 at 12:23
  • Thanks all . It works great. – iamabhaykmr Aug 21 '18 at 12:28
  • Only problem is its matching the spaces left and right as well which causing problem in my next step of the project. Can we not matches left and right spaces , if possible ? @WiktorStribiżew – iamabhaykmr Aug 21 '18 at 12:29
  • Got it working . Thanks again. – iamabhaykmr Aug 21 '18 at 12:35
  • Are those unescaped `.` in the pattern of yours meant to match any char? I understood those were some abbreviations. – Wiktor Stribiżew Aug 21 '18 at 13:04
  • Does this help: https://stackoverflow.com/a/37571199/2064981 – SamWhan Aug 21 '18 at 13:37

1 Answers1

1

You are using \s that matches any kind of vertical and horizontal whitespaces. If you plan to just match spaces and tabs, replace it with [ \t].

Besides, you should consider escaping dots in your pattern (they are all outside of character classes) to match literal dots, else, they match any char but a line break char.

Also, you do not need a capturing group around the whole pattern, you may always get the whole match via Group 0 (that you may access when iterating all match data objects returned with re.finditer).

So, you may use

\d[\d \t,.]*(?:carpet|sft|feet|sqft|yard|gaj|feet|s\.ft|sq\.ft|sq feet|fq\.ft\.|sq\.ft\.|pt|crpt|ft|sq\.mt\.|sq\.mtr|sq\.mt|plot|sf|sfqt|acer|gj|vigha|anna|gunta|sq|gunthe|guntha|bigha|sqd|sqm|sqyd|area|acre|square|yrd|sq\.yard|sq yd|sq\.yd|sq\. yd\.|gaj|sqt)s?

See the regex demo.

You may use re.findall(pattern, s) to get all matches as a list. Or, if you need a list of tuples containing specific submatches, wrap those parts with capturing parentheses. E.g., to capture the number to one group and the measurement unit into another, use (\d(?:[\d ,.]*\d)?)[ \t]*((?:carpet|sft|feet|sqft|yard|gaj|feet|s\.ft|sq\.ft|sq feet|fq\.ft\.|sq\.ft\.|pt|crpt|ft|sq\.mt\.|sq\.mtr|sq\.mt|plot|sf|sfqt|acer|gj|vigha|anna|gunta|sq|gunthe|guntha|bigha|sqd|sqm|sqyd|area|acre|square|yrd|sq\.yard|sq yd|sq\.yd|sq\. yd\.|gaj|sqt)s?). Note I revamped \d(?:[\d \t,.]*\d)? * into (\d(?:[\d ,.]*\d)?)[ \t]* to make sure the spaces after the number are not captured.

Python demo:

import re
s = "kanakiya area 1350     sqft asking price : 95 lacs destination properties azymn - 9920902674 \n plot on rent near sp ring road rajpath club ki gali me road touch 5000 war na 350000 rent owner side no b"
pattern = r'\d[\d ,.]*(?:carpet|sft|feet|sqft|yard|gaj|feet|s\.ft|sq\.ft|sq feet|fq\.ft\.|sq\.ft\.|pt|crpt|ft|sq\.mt\.|sq\.mtr|sq\.mt|plot|sf|sfqt|acer|gj|vigha|anna|gunta|sq|gunthe|guntha|bigha|sqd|sqm|sqyd|area|acre|square|yrd|sq\.yard|sq yd|sq\.yd|sq\. yd\.|gaj|sqt)s?'
print(re.findall(pattern, s))
pattern1 = r'(\d(?:[\d ,.]*\d)?)[ \t]*((?:carpet|sft|feet|sqft|yard|gaj|feet|s\.ft|sq\.ft|sq feet|fq\.ft\.|sq\.ft\.|pt|crpt|ft|sq\.mt\.|sq\.mtr|sq\.mt|plot|sf|sfqt|acer|gj|vigha|anna|gunta|sq|gunthe|guntha|bigha|sqd|sqm|sqyd|area|acre|square|yrd|sq\.yard|sq yd|sq\.yd|sq\. yd\.|gaj|sqt)s?)'
print("Now, with captures:")
for m in re.finditer(pattern1, s):
    print("{} => {}".format(m.group(1), m.group(2)))

Output:

['1350     sqft']
Now, with captures:
1350 => sqft
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563