I'm having issuing using most or all of the cores to process the files faster , it can be reading multiple files a time or using multiple cores to read a single file.
I would prefer using multiple cores to read a single file before moving it to the next.
I tried the code below but can't seem to get all the core used up.
The following code would basically retrieve *.txt file in the directory which contains htmls , in json format.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import json
import urlparse
import os
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool # This is a thread-based Pool
from multiprocessing import cpu_count
def crawlTheHtml(htmlsource):
htmlArray = json.loads(htmlsource)
for eachHtml in htmlArray:
soup = BeautifulSoup(eachHtml['result'], 'html.parser')
if all(['another text to search' not in str(soup),
'text to search' not in str(soup)]):
try:
gd_no = ''
try:
gd_no = soup.find('input', {'id': 'GD_NO'})['value']
except:
pass
r = requests.post('domain api address', data={
'gd_no': gd_no,
})
except:
pass
if __name__ == '__main__':
pool = Pool(cpu_count() * 2)
print(cpu_count())
fileArray = []
for filename in os.listdir(os.getcwd()):
if filename.endswith('.txt'):
fileArray.append(filename)
for file in fileArray:
with open(file, 'r') as myfile:
htmlsource = myfile.read()
results = pool.map(crawlTheHtml(htmlsource), f)
On top of that , i'm not sure what the ,f represent.
Question 1 :
What did i not do properly to fully utilize all the cores/threads ?
Question 2 :
Is there a better way to use try : except : because sometimes the value is not in the page and that would cause the script to stop. When dealing with multiple variables, i will end up with a lot of try & except statement.