1

I have to extract the brand name, model, and sometimes trim level of cars found on a website. Problem is that when I put two groups in my regex, I do not have access to the third element (trim level of the car) and when I put three groups in my regex, I get nothing from cars without trim levels.

<a href="https://XXX.ir/car/bmw/x4">بی‌ام‌و ایکس ۴ </a>
<a href="https://XXX.ir/car/peugeot/405/glx">پژو ۴۰۵ جی‌ال‌ایکس</a>

my_regex_1 = r'https:\/\/XXX\.ir\/car\/(.+)\/(.+)\/(.+)'
my_regex_2 = r'https:\/\/XXX\.ir\/car\/(.+)\/(.+)\/'

My code:

import requests
from bs4 import BeautifulSoup
import re

mainpage = requests.get('https://bama.ir/')
soup = BeautifulSoup(mainpage.text, 'html.parser')
brands = soup.find_all('a')
infos = []
for item in brands:
    link = item['href']
    info = re.findall(r'https:\/\/bama\.ir\/car\/([^\/]+?)\/([^\/]+?)(?:\/([^"]+))?', link)
    infos.append(info)
print(infos)
Mehdi Abbassi
  • 627
  • 1
  • 7
  • 24
  • 1
    i would recommend using beautifulsoup – Zulfiqaar Apr 01 '19 at 15:20
  • Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Zulfiqaar Apr 01 '19 at 15:20
  • Actually I'm using BeautifulSoup, but I want to have both cases: 1. without trim level and 2. with trim level. How should I do it? – Mehdi Abbassi Apr 01 '19 at 15:22
  • what are your expected matches? – Zulfiqaar Apr 01 '19 at 15:25
  • for first link: [('bmw', 'x4')]] and for second one: [('peugeot', '405', 'glx')] – Mehdi Abbassi Apr 01 '19 at 15:30
  • I'm sorry, I had a typo! I give incorrect information on the problem! – Mehdi Abbassi Apr 01 '19 at 15:31
  • So, you have a regex with 3 groups with the last one inside an optional non-capturing group. The `re.findall` will always have 3-item tuple list as output, if you need to get rid of the empty value, you will need to run some list comprehension to rebuild this output. That is all there is to say about it. – Wiktor Stribiżew Apr 01 '19 at 18:27

2 Answers2

2

Try Regex: https:\/\/XXX\.ir\/car\/([^\/]+?)\/([^\/]+?)(?:\/([^\"]+))?\"

Demo

Matt.G
  • 3,586
  • 2
  • 10
  • 23
1

One option here would be to use a library urlparse, and avoid using a regex altogether:

input = "<a href=\"https://XXX.ir/car/bmw/x4/lx\">بی‌ام‌و ایکس ۴ ال‌ایکس</a>"
url = re.sub(r'.*(https?://[^"]+).*', '\\1', input)
path = urlparse.urlparse(url).path
parts = path[1:].split('/')
print(parts)

['car', 'bmw', 'x4', 'lx']

With a list of path components in hand, you may simply iterate it as many times as is needed.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360