I write a Python script for scraping the physics items on this website: https://web.archive.org/web/20160317132756/http://publishing.aip.org/publishing/pacs/pacs-reg00#01
. After running my script, there appears some errors related with regular expression used in the function re.findall()
. The errors are as follows:
raise source.error("missing ), unterminated subpattern",
re.error: missing ), unterminated subpattern at position 0
I tried to fix this error, but all my efforts fail. Now I do not know how to handle this problem and modify the regex pattern correctly.
The Python script is following:
import requests
import re
from bs4 import BeautifulSoup
url = "https://web.archive.org/web/20160317132756/http://publishing.aip.org/publishing/pacs/pacs-reg00#01"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
pair_1st = soup.find("tr") # pair_1st = 00. GENERAL
items = soup.find_all("tr", valign="TOP") #A big list[<tr>...,</tr>, ...]
with open("PACS_topic_ALL_Level_Clean.txt", "w", encoding="utf-8") as f:
for item in items:
if item != pair_1st:
pair_2nd = item.select('font[size]')
if len(pair_2nd) > 1:
code_2nd = pair_2nd[0].text.strip()
title_2nd = pair_2nd[1].text.strip()
code_title_2nd = code_2nd +' ' + title_2nd
Code_2 = code_2nd[:2]
pat_2 = rf'(^{Code_2}.\d\d.[+-]\w).*?'
codes_3rd = re.findall(pat_2, soup.text, re.S|re.M)
for code in codes_3rd:
code_3rd_pseudo = soup.find(string=re.compile(r'{}'.format(code)))
code_3rd = code_3rd_pseudo.text.strip()
name_3rd = code_3rd_pseudo.find_next("b").text.strip()
code_name_3rd = code_3rd +' ' + name_3rd
Code_3 = code_3rd[:6]
pat_3 = rf'(^{Code_3}\w[-\w]).*?'
codes_4th = re.findall(pat_3, soup.text, re.S|re.M)
f.write(f'{code_title_2nd}/{code_name_3rd}|{str(len(codes_4th))}\n')
if len(codes_4th) != 0:
for code in codes_4th:
code_4th_pseudo = soup.find(string=re.compile(r'{}'.format(code)))
code_4th = code_4th_pseudo.text.strip()
title_4th = code_4th_pseudo.find_next().text.strip()
code_title_4th = code_4th + title_4th
f.write(f"{code_title_4th}\n")
I want to introduce some variables in the regular expression, so I use the combined mode of raw string and f-string. I guess this works normally, but there appears some unexpected errors concerning the regex pattern. I have describes the problems as mentioned above.