0

I write a Python script for scraping the physics items on this website: https://web.archive.org/web/20160317132756/http://publishing.aip.org/publishing/pacs/pacs-reg00#01. After running my script, there appears some errors related with regular expression used in the function re.findall(). The errors are as follows:

raise source.error("missing ), unterminated subpattern",
re.error: missing ), unterminated subpattern at position 0

I tried to fix this error, but all my efforts fail. Now I do not know how to handle this problem and modify the regex pattern correctly.

The Python script is following:

import requests
import re
from bs4 import BeautifulSoup

url = "https://web.archive.org/web/20160317132756/http://publishing.aip.org/publishing/pacs/pacs-reg00#01"

response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

pair_1st = soup.find("tr") # pair_1st = 00. GENERAL

items = soup.find_all("tr", valign="TOP") #A big list[<tr>...,</tr>, ...]
with open("PACS_topic_ALL_Level_Clean.txt", "w", encoding="utf-8") as f:
    for item in items: 
        if item != pair_1st:
            pair_2nd = item.select('font[size]')
            if len(pair_2nd) > 1: 
                code_2nd = pair_2nd[0].text.strip()
                title_2nd = pair_2nd[1].text.strip()
                code_title_2nd = code_2nd +' ' + title_2nd 
                Code_2 = code_2nd[:2]
                pat_2 = rf'(^{Code_2}.\d\d.[+-]\w).*?'
                codes_3rd = re.findall(pat_2, soup.text, re.S|re.M)
                for code in codes_3rd: 
                    code_3rd_pseudo = soup.find(string=re.compile(r'{}'.format(code)))
                    code_3rd = code_3rd_pseudo.text.strip()
                    name_3rd = code_3rd_pseudo.find_next("b").text.strip()
                    code_name_3rd = code_3rd +' ' + name_3rd 
                    Code_3 = code_3rd[:6]
                    pat_3 = rf'(^{Code_3}\w[-\w]).*?'
                    codes_4th = re.findall(pat_3, soup.text, re.S|re.M)
                    f.write(f'{code_title_2nd}/{code_name_3rd}|{str(len(codes_4th))}\n')
                    if len(codes_4th) != 0:
                        for code in codes_4th: 
                            code_4th_pseudo = soup.find(string=re.compile(r'{}'.format(code)))
                            code_4th = code_4th_pseudo.text.strip()
                            title_4th = code_4th_pseudo.find_next().text.strip()
                            code_title_4th = code_4th + title_4th 
                            f.write(f"{code_title_4th}\n")

I want to introduce some variables in the regular expression, so I use the combined mode of raw string and f-string. I guess this works normally, but there appears some unexpected errors concerning the regex pattern. I have describes the problems as mentioned above.

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
GL_n
  • 11
  • 3

0 Answers0