How to clean up the data from this webscraping script?

Question

So here is my code:

import requests
from bs4 import BeautifulSoup
import lxml

r = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(r.text, "lxml")

tables = soup.find_all('table')
print(tables)



print(tables)

I had to do a post request due to the fact that it's an ASP page, and I had to grab the correct data. Looking in the college of Business for all tables from a specific semester. The problem is the output:

<tr class="tableback2"><td>Overall assessment of instructor</td><td align="right">0.0%</td><td align="right">56.8%</td><td align="right">27.0%</td><td align="right">13.5%</td><td align="right">2.7%</td><td align="right">0.0%</td> </tr>
</table>, <table align="center" border="0" cellpadding="0" cellspacing="0" width="75%">
<tr class="boldtxt"><td>Term: 1175 - Summer 2017</td></tr><tr class="boldtxt"><td>Instructor Name: Austin, Lathan Craig</td><td colspan="6"> Department: MARKETING</td></tr>
<tr class="boldtxt"><td>Course: TRA   4721  </td><td colspan="2">Section: RVBB-1</td><td colspan="4">Title: Global Logistics</td></tr>
<tr class="boldtxt"><td>Enrolled: 56</td><td colspan="2">Ref#: 55703 -1</td><td colspan="4"> Completed Forms: 46</td></tr>

I expected beautifulsoup to be able to parse the text, and return it nice and neat into a dataframe with each column separated. I would like to put it into a dataframe after, or perhaps save it to a CSV file.... But I have no idea how to get rid of all of these CSS selectors and tags. I tried using this code to do so, and it removed the ones specified, but td and tr didn't work:

for tag in soup():
    for attribute in ["class", "id", "name", "style", "td", "tr"]:
        del tag[attribute]

Then, I tried to use this package called bleach, but when putting the 'tables' into it but it specified that it must be a text input. So I can't put my table into it apparently. This is ideally what I would like to see with my output.

So I'm truly at a loss here of how to format this in a proper way. Any help is much appreciated.

I'm not familiar with Bleach, but you can just iterate over the table rows and cells: https://stackoverflow.com/questions/15250455/how-to-parse-html-table-with-python-and-beautifulsoup-and-write-to-csv — helb, Oct 02 '17 at 15:18
…and adapted for your code, with example output: https://gist.github.com/helb/be5263f7ce9d83e7dbe8c11277363814 — helb, Oct 02 '17 at 15:28
@helb your gist code worked, but the output isn't the same on my end as it is for you. Yours looks perfectly formatted, but mine looks markedly different. https://imgur.com/a/RtcHI < image. Not exactly sure why mine looks different. Did I mess up the code? — J Sowwy, Oct 02 '17 at 15:46
You could just iterate and replace with a regex expression via [`re.sub()`](https://docs.python.org/2/library/re.html#re.sub) — Mangohero1, Oct 02 '17 at 15:47
@JSowwy Your code looks okay. It might be either the terminal in your IDE messing up tabs, or an older Python version (`end` in `print()` is supported since 3.0, what version do you use?). Working example on repl.it: https://repl.it/Lwle/1 (loading data from a saved file instead of `requests`, but the rest is the same) — helb, Oct 02 '17 at 15:50
Printing without a newline in various Python versions: https://stackoverflow.com/questions/493386/how-to-print-without-newline-or-space#493399 — helb, Oct 02 '17 at 15:51
@mangoHero1 Yeah, but it can break things easily: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — helb, Oct 02 '17 at 15:52
Ah.. good point. I suppose you could just pretty print the values instead :-) — Mangohero1, Oct 02 '17 at 15:54
@helb I use Jetbrains pycharm. It's the one our university recommended. If there's a better or more easily usable IDE, I'm more than willing to try it. I do believe I have the most up to date Python version, however. So how do I fix this, and pretty print it like yours? :P — J Sowwy, Oct 02 '17 at 15:56
@JSowwy I don't know. Just tried it in PyCharm, and it seems to work: https://vgy.me/fpIIcx.png What's the first line in PyCharm's output (mine says it's using Python 3.6)? — helb, Oct 02 '17 at 16:14
@helb updating my PyCharm version and removing all instances and data from the previous fixed it! You are the man. But I have one last question. How can I neatly put this into a dataframe? It's a for loop nested in a for loop so I don't know how to store it as a dataframe... hmm... I tried using d = [] and then your for loop, and then pd.DataFrame(d). Would that work? — J Sowwy, Oct 03 '17 at 22:09

SIM · Answer 1 · 2017-10-02T17:36:29.497

Give this a try. I suppose this is what you expected. Btw, if there are more than one tables in that page and if you want another table then twitch the index, as in soup.select('table')[n]. Thanks.

import requests
from bs4 import BeautifulSoup

res = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(res.text, "lxml")

tables = soup.select('table')[0]
list_items = [[items.text.replace("\xa0","") for items in list_item.select("td")]
                    for list_item in tables.select("tr")] 

for data in list_items:
    print(' '.join(data))

Partial results:

Term: 1175 - Summer 2017
Instructor Name: Elias, Desiree   Department: SCHACCOUNT
Course: ACG   2021   Section: RVCC-1 Title: ACC Decisions
Enrolled: 118 Ref#: 51914 -1  Completed Forms: 36

How to clean up the data from this webscraping script?

1 Answers1

Linked