0

I'm trying to create a dataframe from a webscraping. Precisely: from a search of a topic on github, the objective is to retrieve the name of the owner of the repo, the link and the about.

I have many problems.

1. The search shows that there are, for example, more than 300,000 repos, but my scraping can only get the information from 90. I would like to scrape all available repos.

2. Sometimes about is empty. It stops me after creating a dataframe

ValueError: All arrays must be of the same length

3. My search for names is completely strange.

My code:

import requests
from bs4 import BeautifulSoup

import pandas as pd
import re

headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 5.1.1; SM-G928X Build/LMY47X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.83 Mobile Safari/537.36'}

search_topics = "https://github.com/search?p="

stock_urls = []
stock_names = []
stock_about = []

for page in range(1, 99):

    req = requests.get(search_topics + str(page) + "&q=" + "nlp" + "&type=Repositories", headers = headers)
    soup = BeautifulSoup(req.text, "html.parser")

    #about
    for about in soup.select("p.mb-1"):
        stock_about.append(about.text)

    #urls
    for url in soup.findAll("a", attrs = {"class":"v-align-middle"}):
        link = url['href']
        complete_link = "https://github.com" + link
        stock_urls.append(complete_link)

    #profil name
    for url in soup.findAll("a", attrs = {"class":"v-align-middle"}):
        link = url['href']
        names = re.sub(r"\/(.*)\/(.*)", "\1", link)
        stock_names.append(names)

dico = {"name": stock_names, "url": stock_urls, "about": stock_about}       

#df = pd.DataFrame({"name": stock_names, "url": stock_urls, "about": stock_about})
df = pd.DataFrame.from_dict(dico)

My output:

ValueError: All arrays must be of the same length

ladybug
  • 572
  • 2
  • 4
  • 13
  • 2
    GitHub has a search [API](https://docs.github.com/en/rest/search?apiVersion=2022-11-28). Use this instead of scraping. – baduker Dec 14 '22 at 10:49
  • 1
    As for the error you're getting, check this https://stackoverflow.com/questions/40442014/python-pandas-valueerror-arrays-must-be-all-same-length – baduker Dec 14 '22 at 10:50

1 Answers1

2

Lazy fix: zip all the lists together so that the columns are cropped to match lengths with the shortest one. [DataFrame data: list of dictionaries formed using list comprehension.]

pd.DataFrame([{'name': n, 'url': u, 'about': a} for n, u, a 
               in zip(stock_names, stock_urls, stock_about)])

But there's a problem that would then be ignored: if the lists don't match up, how can you know that stock_names[i] and stock_urls[i] and stock_about[i] [for any given i] are of the same repo? The lists don't match up because some repos are missing an "about" section, but because the lists are built individually, you don't have any way to figure out which ones.

That's why it's better to merge the loops [in a sense] - loop through the containers of individual results and build dico up as a list of dictionaries right from the start, result by result:

dico = []

for page in range(1, 99):

    req = requests.get(search_topics + str(page) + "&q=" + "nlp" + "&type=Repositories", headers = headers)
    soup = BeautifulSoup(req.text, "html.parser")

    # for repo in soup.find('ul', class_="repo-list").find_all('li'):
    for repo in soup.select('ul.repo-list>li:has(a.v-align-middle[href])'):
        link = repo.select_one('a.v-align-middle[href]')
        about = repo.select_one('p.mb-1') 

        dico.append({
            # 'name': re.sub(r"\/(.*)\/(.*)", "\1", link.get('href')),
            'name': ' by '.join(link.text.strip().split('/', 1)[::-1]),
            'url': "https://github.com" + link.get('href'),
            'about': about.text.strip() if about else None
        })

df = pandas.DataFrame(dico)

df looks something like

df

Driftr95
  • 4,572
  • 2
  • 9
  • 21