grateful for your help.
I'm trying to return multiple search results from Google based on two or more search terms. Example inputs:
digital economy gov.uk
digital economy gouv.fr
For about 50% of the search results I input, the script below works fine. However, for the remaining search terms, I receive:
ValueError: arrays must all be same length
Any ideas on how I can address this?
output_df1=pd.DataFrame()
for input in inputs:
query = input
#query = urllib.parse.quote_plus(query)
number_result = 20
ua = UserAgent()
google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result)
response = requests.get(google_url, {"User-Agent": ua.random})
soup = BeautifulSoup(response.text, "html.parser")
result_div = soup.find_all('div', attrs = {'class': 'ZINbbc'})
links = []
titles = []
descriptions = []
for r in result_div:
# Checks if each element is present, else, raise exception
try:
link = r.find('a', href = True)
title = r.find('div', attrs={'class':'vvjwJb'}).get_text()
description = r.find('div', attrs={'class':'s3v9rd'}).get_text()
# Check to make sure everything is present before appending
if link != '' and title != '' and description != '':
links.append(link['href'])
titles.append(title)
descriptions.append(description)
# Next loop if one element is not present
except:
continue
to_remove = []
clean_links = []
for i, l in enumerate(links):
clean = re.search('\/url\?q\=(.*)\&sa',l)
# Anything that doesn't fit the above pattern will be removed
if clean is None:
to_remove.append(i)
continue
clean_links.append(clean.group(1))
output_dict = {
'Search_Term': input,
'Title': titles,
'Description': descriptions,
'URL': clean_links,
}
search_df = pd.DataFrame(output_dict, columns = output_dict.keys())
#merging the data frames
output_df1=pd.concat([output_df1,search_df])
Based on this answer: Python Pandas ValueError Arrays Must be All Same Length I have also tried to use orient=index. While this does not give me the array error, it only returns one response for each search result:
a = {
'Search_Term': input,
'Title': titles,
'Description': descriptions,
'URL': clean_links,
}
search_df = pd.DataFrame.from_dict(a, orient='index')
search_df = search_df.transpose()
#merging the data frames
output_df1=pd.concat([output_df1,search_df])
Edit: based on @Hammurabi's answer, I was able to at least pull 20 returns per input, but these appear to be duplicates. Any idea how I iterate the unique returns to each row?
df = pd.DataFrame()
cols = ['Search_Term', 'Title', 'Description', 'URL']
for i in range(20):
df_this_row = pd.DataFrame([[input, titles, descriptions, clean_links]], columns=cols)
df = df.append(df_this_row)
df = df.reset_index(drop=True)
##merging the data frames
output_df1=pd.concat([output_df1,df])
Any thoughts on either how I can address the array error so it works for all search terms? Or how I make the orient='index' method work for multiple search results - in my script I am trying to pull 20 results per search term.
Thanks for your help!