-1

I am scraping a county website that posts emergency calls and their locations. I have found success webscraping basic elements, but am having trouble scraping the rows of the table.

(Here is an example of what I am working with codewise)

location = list.find('div', class_='listing-search-item__sub-title')

Im not sure how to specifically webscrape the rows of the table. Can anyone explain how to dig into the sublevels of html to look for these records ? I'm not sure if I need to dig into tr, table, tbody, td, etc. Could use some guidance on which division or class to assign to dig into the data.

enter image description here

zemken12
  • 59
  • 1
  • 6
  • 2
    As a sidenote for tables you can also use [pandas.from_html](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html) it sometimes needs some tweaking and filtering to get the correct table - often you get to much results but it often saves a lot of manual hassle of doing it with BS. – Daraan Oct 13 '22 at 21:26
  • Give us the link to the site, please – eightlay Oct 14 '22 at 00:09

1 Answers1

1

For extracting specific nested elements, I often prefer to use .select, which uses css selectors (bs4 doesn't seem to have any support for xpath but you can also check out these solutions using the lxml library), so for your case you could use something like

soup.select_one('table[id="form1:tableEx1"]').select('tbody tr')

but the results might look a bit weird since the columns might not be separated - to have separated columns/cells, you could get the of rows as tuples instead with

tableRows = [
    tuple([c.text.strip() for c in r.find_all(['th', 'td'])]) for r 
    in BeautifulSoup(tHtml).select_one(
        'table[id="form1:tableEx1"]'
    ).select('tbody tr')
]

(Note that you can't use the .select(#id) format when the id contains a ":".)

As one of the comments mentioned, you can use pandas.read_html(htmlString) to get a list of tables in the html; if you want a specific table, use the attrs argument:

# import pandas
pandas.read_html(htmlString, attrs={'id': 'form1:tableEx1'})[0]

but you will get the whole table - not just what's in tbody; and this will flatten any tables that are nested inside (see results with table used from this example).

And the single-statement method I showed at first with select cannot be used at all with nested tables since the output will be scrambled. Instead, if you want to preserve any nested inner tables without flattening, and if you are likely to be scraping tables often, I have the following set of functions which can be used in general:

  • first define two other function that the main table extractor depends on:
# get a list of tagNames between a tag and its ancestor
def linkAncestor(t, a=None):
  aList = []
  while t.parent != a or a is None:
    t = t.parent 
    if t is None:
      if a is not None: aList = None
      break
    aList.append(t.name)
  return aList
  # if a == t.parent: return []
  # if a is None, return tagNames of ALL ancestors 
  # if a not in t.parents: return None

def getStrings_table(xSoup): 
  # not perfect, but enough for me so far
  tableTags = ['table', 'tr', 'th', 'td']
  return "\n".join([
      c.get_text(' ', strip=True) for c in xSoup.children 
      if c.get_text(' ', strip=True) and (c.name is None or (
          c.name not in tableTags and not c.find(tableTags)
      ))
  ])
  • then, you can define the function for extracting the tables as python dictionaries:
def tablesFromSoup(mSoup, mode='a', simpleOp=False):
  typeDict = {'t': 'table', 'r': 'row', 'c': 'cell'}
  finderDict = {'t': 'table', 'r': 'tr', 'c': ['th', 'td']}
  refDict = {
    'a': {'tables': 't', 'loose_rows': 'r', 'loose_cells': 'c'},
    't': {'inner_tables': 't', 'rows': 'r', 'loose_cells': 'c'},
    'r': {'inner_tables': 't', 'inner_rows': 'r', 'cells': 'c'}, 
    'c': {'inner_tables': 't', 'inner_rows': 'r', 'inner_cells': 'c'}
  }
  mode = mode if mode in refDict else 'a'

  # for when simpleOp = True
  nextModes = {'a': 't', 't': 'r', 'r': 'c', 'c': 'a'}
  mainCont = {
      'a': 'tables', 't': 'rows', 'r': 'cells', 'c': 'inner_tables'
  }

  innerContent = {} 
  for k in refDict[mode]: 
    if simpleOp and k != mainCont[mode]: 
      continue
    
    fdKey = refDict[mode][k] # also the mode for recursive call
    innerSoups = [(
        s, linkAncestor(s, mSoup)
    ) for s in mSoup.find_all(finderDict[fdKey])] 
    innerSoups = [s for s, la in innerSoups if not (
        'table' in la or 'tr' in la or 'td' in la or 'th' in la
    )]

    # recursive call
    kCont = [tablesFromSoup(s, fdKey, simpleOp) for s in innerSoups] 

    if simpleOp:
      if kCont == [] and mode == 'c': break
      return tuple(kCont) if mode == 'r' else kCont

    # if not empty, check if header then add to output
    if kCont: 
      if 'row' in k:
        for i in range(len(kCont)):
          if 'isHeader' in kCont[i]: continue
          kCont[i]['isHeader'] = 'thead' in innerSoups[i][1]
      if 'cell' in k:
        isH = [(c[0].name == 'th' or 'thead' in c[1]) for c in innerSoups]
        if sum(isH) > 0:
          if mode == 'r':
            innerContent['isHeader'] = True
          else: 
            innerContent[f'isHeader_{k}'] = isH
      
      innerContent[k] = kCont 
  
  if innerContent == {} and mode == 'c':
    innerContent = mSoup.get_text(' ', strip=True) 
  elif mode in typeDict:
    if innerContent == {}: 
      innerContent['innerText'] = mSoup.get_text(' ', strip=True)
    else:
      innerStrings = getStrings_table(mSoup)
      if innerStrings:
        innerContent['stringContent'] = innerStrings
    innerContent['type'] = typeDict[mode] 
  
  return innerContent

With the same example as before, this function gives this output; if the simpleOp argument is set to True, it results in a simpler output, but then the headers are no longer differentiated and some other peripheral data is also excluded.

Driftr95
  • 4,572
  • 2
  • 9
  • 21