How use regex in python 3.7 to have 2 OR 3 groups?

Question

I have to extract the brand name, model, and sometimes trim level of cars found on a website. Problem is that when I put two groups in my regex, I do not have access to the third element (trim level of the car) and when I put three groups in my regex, I get nothing from cars without trim levels.

<a href="https://XXX.ir/car/bmw/x4">بی‌ام‌و ایکس ۴ </a>
<a href="https://XXX.ir/car/peugeot/405/glx">پژو ۴۰۵ جی‌ال‌ایکس</a>

my_regex_1 = r'https:\/\/XXX\.ir\/car\/(.+)\/(.+)\/(.+)'
my_regex_2 = r'https:\/\/XXX\.ir\/car\/(.+)\/(.+)\/'

My code:

import requests
from bs4 import BeautifulSoup
import re

mainpage = requests.get('https://bama.ir/')
soup = BeautifulSoup(mainpage.text, 'html.parser')
brands = soup.find_all('a')
infos = []
for item in brands:
    link = item['href']
    info = re.findall(r'https:\/\/bama\.ir\/car\/([^\/]+?)\/([^\/]+?)(?:\/([^"]+))?', link)
    infos.append(info)
print(infos)

Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Zulfiqaar, Apr 01 '19 at 15:20
Actually I'm using BeautifulSoup, but I want to have both cases: 1. without trim level and 2. with trim level. How should I do it? — Mehdi Abbassi, Apr 01 '19 at 15:22
for first link: [('bmw', 'x4')]] and for second one: [('peugeot', '405', 'glx')] — Mehdi Abbassi, Apr 01 '19 at 15:30
I'm sorry, I had a typo! I give incorrect information on the problem! — Mehdi Abbassi, Apr 01 '19 at 15:31
So, you have a regex with 3 groups with the last one inside an optional non-capturing group. The `re.findall` will always have 3-item tuple list as output, if you need to get rid of the empty value, you will need to run some list comprehension to rebuild this output. That is all there is to say about it. — Wiktor Stribiżew, Apr 01 '19 at 18:27

Matt.G · Accepted Answer · 2019-04-01T20:25:33.813

2

Try Regex: https:\/\/XXX\.ir\/car\/([^\/]+?)\/([^\/]+?)(?:\/([^\"]+))?\"

Demo

edited Apr 01 '19 at 20:25

answered Apr 01 '19 at 15:49

Matt.G

3,586
2
10
23

gives nothing! Just an empty list. – Mehdi Abbassi Apr 01 '19 at 15:55
As you could see in the demo, the capture groups have the values that you are looking for. show us your python code.. – Matt.G Apr 01 '19 at 15:57
In the link you provided, by changing FLAVOR to Python, the regex stops detecting anything. – Mehdi Abbassi Apr 01 '19 at 20:24
1

@MehdiAbbassi, I've updated the post to work with Python – Matt.G Apr 01 '19 at 20:25

score 1 · Answer 2 · answered Apr 01 '19 at 15:27

1

One option here would be to use a library urlparse, and avoid using a regex altogether:

input = "<a href=\"https://XXX.ir/car/bmw/x4/lx\">بی‌ام‌و ایکس ۴ ال‌ایکس</a>"
url = re.sub(r'.*(https?://[^"]+).*', '\\1', input)
path = urlparse.urlparse(url).path
parts = path[1:].split('/')
print(parts)

['car', 'bmw', 'x4', 'lx']

With a list of path components in hand, you may simply iterate it as many times as is needed.

answered Apr 01 '19 at 15:27

Tim Biegeleisen

502,043
27
286
360

I'm sorry, but I edited my question and now it is what I really wanted to ask! – Mehdi Abbassi Apr 01 '19 at 15:32
@MehdiAbbassi Did you bother to read/test my answer? It gets around the regex capture group problem by instead just putting all available path components into a single level list. – Tim Biegeleisen Apr 01 '19 at 15:39
I'm sorry and I appreciate you help. But I have using your code since I'm using Python 3 and another point is that I only wants links like http://XXX.ir/car/..." and not all links! – Mehdi Abbassi Apr 01 '19 at 15:48
I have heard that soup can be beautiful. – Tim Biegeleisen Apr 01 '19 at 15:52
Actually I'm using BeautifulSoup. But I don't know how to use this great package to solve my problem. – Mehdi Abbassi Apr 01 '19 at 15:54

How use regex in python 3.7 to have 2 OR 3 groups?

2 Answers2