0

Code:

import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urljoin
import json
from os import listdir


res = requests.get('http://www.abcde.com/frontend/SearchParts')
soup = BeautifulSoup(res.text,"lxml")
href = [ a["href"] for a in soup.findAll("a", {"id" : re.compile("parts_img.*")})] 
b1 =[]
for url in href:
    b1.append("http://www.abcde.com"+url)
#print (b1)  
b=[]
for i in range(len(b1)):
    res2 = requests.get(b1[i]).text
    soup2 = BeautifulSoup(res2,"lxml")
    url_n=soup2.find('',rel = 'next')['href']  
    url_n=("http://www.abcde.com"+url_n)
    #print(url_n)  

    b.append(b1[i]) 
    b.append(url_n)  
    while True:   
        res3=requests.get(url_n).text
        soup3 = BeautifulSoup(res3,"lxml")
        try:
            url_n=soup3.find('',rel = 'next')['href']  
        except TypeError:   
            break
        if url_n:
            url_n=("http://www.abcde.com"+url_n)
            #print(url_n)
            b.append(url_n)     
all=[]
for url in b:
    res = requests.get(url)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select(".article-title"):
        all.append(urljoin('http://www.abcde.com',item['href']))  
for urls in all:
    re=requests.get(urls)
    soup=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
    title_tag = soup.select_one('.page_article_title')
    list=[]
    for tag in soup.select('.page_article_content'):
        list.append(tag.text)
    list=([c.replace('\n', '') for c in list])
    list=([c.replace('\r', '') for c in list])
    list=([c.replace('\t', '') for c in list])
    list=([c.replace(u'\xa0', u' ') for c in list])
    list= (', '.join(list))   
    fruit_tag = soup.select_one('.authorlink')
    fruit_final=None
    if fruit_tag:
        fruit_final= fruit_tag.text
    else:
        fruit_final= fruit_tag
    keys=soup.findAll('div', style="font-size:1.2em;")
    keys_final=None
    list2=[]
    if keys:
        for key in keys:
            list2.append(key.text)
        list2=([c.replace('\n', '') for c in list2])
        list2=([c.replace(' ', '') for c in list2])
        list2= (', '.join(list2))
        key_final=list2
    else:
        key_final=keys
        if key_final==[]:
            key_final=None

##################edit part####################################
data={
    "Title" : title_tag.text,
    "Registration": fruit_final,
    "Keywords": key_final,
    "Article": list
}


save_path= "C:/json/"   
files=listdir(save_path)
file_name = save_path+'%s.json' % title_tag.text
with open(file_name, 'w',encoding='UTF-8') as f:
    if file_name not in files:
        file_d = json.dumps(data,ensure_ascii=False)   
        f.write(file_d)
    else:
        file_name = save_path +'%s_1.json' % title_tag.text
        file_d = json.dumps(data,ensure_ascii=False)   
        f.write(file_d)

I scraped a web page and extract every article's title as title_tag.text. I found that some articles have same titles but different urls/contents, so I still need to save them in my directory. Now I know how to check it if two titles are the same, I can just name one as original and another with original_1. But what if I need to save 4 files which have same titles? How to do it in this case? Thanks in advance!

Makiyo
  • 441
  • 5
  • 23
  • Could check to see if that file name already exists, and if so - add a `(1)` to the end of the file name you are about to create? Increment the number each time it is found. You basically need to read the dir you are in before you write, gather the data about the files in the dir, and make a decision based off that. If you find 4 `TitleName.json` files, with or without the suffix, then you make the 5th file youre about to created called `TitleName(5).json` – Goralight Oct 18 '17 at 09:35
  • Check this Answer: https://stackoverflow.com/questions/13852700/python-create-file-but-if-name-exists-add-number – Goralight Oct 18 '17 at 09:40
  • @Goralight Could you please be more specific? No offense! Since I am new to Python, only started learning it in less than 2 weeks, sometimes I cannot totally understand what people said. I just tried to make the code as easy as it can be even if it's not that efficient. – Makiyo Oct 18 '17 at 10:01
  • first , its confusing without sample input and output , provide little more information, just a solution, you can try one thing, something like this https://stackoverflow.com/a/46742446/5904928 – Aaditya Ura Oct 18 '17 at 10:12
  • @AyodhyankitPaul thank you! problem solved :) I actually check this one https://stackoverflow.com/questions/82831/how-do-i-check-whether-a-file-exists-using-python, more handy for me to understand – Makiyo Oct 18 '17 at 15:52

0 Answers0