0

grateful for your help.

I'm trying to return multiple search results from Google based on two or more search terms. Example inputs:

digital economy gov.uk

digital economy gouv.fr

For about 50% of the search results I input, the script below works fine. However, for the remaining search terms, I receive:

ValueError: arrays must all be same length

Any ideas on how I can address this?

output_df1=pd.DataFrame()

  for input in inputs:

  query = input

  #query = urllib.parse.quote_plus(query)

  number_result = 20

  ua = UserAgent()

  google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result)
  response = requests.get(google_url, {"User-Agent": ua.random})
  soup = BeautifulSoup(response.text, "html.parser")

  result_div = soup.find_all('div', attrs = {'class': 'ZINbbc'})

  links = []
  titles = []
  descriptions = []
  for r in result_div:
  # Checks if each element is present, else, raise exception
    try:
      link = r.find('a', href = True)
      title = r.find('div', attrs={'class':'vvjwJb'}).get_text()
      description = r.find('div', attrs={'class':'s3v9rd'}).get_text()
      
      # Check to make sure everything is present before appending
      if link != '' and title != '' and description != '': 
          links.append(link['href'])
          titles.append(title)
          descriptions.append(description)
  # Next loop if one element is not present
    except:
        continue
      
  to_remove = []
  clean_links = []
  for i, l in enumerate(links):
    clean = re.search('\/url\?q\=(.*)\&sa',l)

    # Anything that doesn't fit the above pattern will be removed
    if clean is None:
        to_remove.append(i)
        continue
    clean_links.append(clean.group(1))

  output_dict = {
  'Search_Term': input,
  'Title': titles,
  'Description': descriptions,
  'URL': clean_links,
  }

  search_df = pd.DataFrame(output_dict, columns = output_dict.keys())

      #merging the data frames
  output_df1=pd.concat([output_df1,search_df])

Based on this answer: Python Pandas ValueError Arrays Must be All Same Length I have also tried to use orient=index. While this does not give me the array error, it only returns one response for each search result:

  a = {
  'Search_Term': input,
  'Title': titles,
  'Description': descriptions,
  'URL': clean_links,
  }

  search_df = pd.DataFrame.from_dict(a, orient='index')
  search_df = search_df.transpose()


      #merging the data frames
  output_df1=pd.concat([output_df1,search_df])

Edit: based on @Hammurabi's answer, I was able to at least pull 20 returns per input, but these appear to be duplicates. Any idea how I iterate the unique returns to each row?

  df = pd.DataFrame()
  cols = ['Search_Term', 'Title', 'Description', 'URL']

  for i in range(20):

      df_this_row = pd.DataFrame([[input, titles, descriptions, clean_links]], columns=cols)
      df = df.append(df_this_row)

      df = df.reset_index(drop=True)

  ##merging the data frames
  output_df1=pd.concat([output_df1,df])

Any thoughts on either how I can address the array error so it works for all search terms? Or how I make the orient='index' method work for multiple search results - in my script I am trying to pull 20 results per search term.

Thanks for your help!

Simnicjon
  • 105
  • 1
  • 11

1 Answers1

1

You are having trouble with columns of different lengths, maybe because sometimes you get more or fewer than 20 results per term. You can put dataframes together even if they have different lengths. I think you want to append the dataframes, because you have different search terms so there is probably no merging to do to consolidate matching search terms. I don't think you want orient='index' because in the example you post, that puts lists into the df, rather than separating out the list items into different columns. Also, I don't think you want the built-in input as part of the df, looks like you want to repeat the query for each relevant row. Maybe something is going wrong in the dictionary creation.

You could consider appending 1 row at a time to your main dataframe, and skip the list and dictionary creation, after your line

if link != '' and title != '' and description != '': 

Maybe simplifying the df creation will avoid the error. See this toy example:

df = pd.DataFrame()
cols = ['Search_Term', 'Title', 'Description', 'URL']

query = 'search_term1'
for i in range(2):
    link = 'url' + str(i)
    title = 'title' + str(i)
    description = 'des' + str(i)
    df_this_row = pd.DataFrame([[query, title, description, link]], columns=cols)
    df = df.append(df_this_row)

df = df.reset_index(drop=True)       # originally, every row has index 0
print(df)
#     Search_Term   Title Description   URL
# 0  search_term1  title0        des0  url0
# 1  search_term1  title1        des1  url1

Update: you mentioned that you are getting the same result 20 times. I suspect that is because you are only getting number_result = 20, and you probably want to iterate instead.

Your code fixes number_result at 20, then uses it in the url:

number_result = 20
# ...
google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result)

Try iterating instead:

for number_result in range(1, 21):  # if results start at 1 rather than 0
    # ...
    google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result)


Hammurabi
  • 1,141
  • 1
  • 4
  • 7
  • Thanks this is useful. I can now pull the first result from each search term into a dataframe. Struggling to iterate across all 20 results though; it just duplicates the first result 20 times. Any idea how to iterate it? – Simnicjon Jul 05 '21 at 00:27
  • I think I might understand what is happening, I'll edit the answer. – Hammurabi Jul 05 '21 at 00:38
  • Thanks, I see what you're trying to do, but still no joy in actually integrating it – Simnicjon Jul 06 '21 at 23:10