Getting Table Attributes from a Website

Question

I am using Python 3.4, Windows 10, and Visual Studio 2015. I am trying to make a program that scrapes phone numbers from websites formatted like this one. I am using Beautiful Soup 4, and am trying to get the number of beds from the table. I have tried soup.select('.td') and it only returns an empty array, I am not sure what else to try.

`.td` is looking for class named td, see http://stackoverflow.com/questions/13074586/extracting-selected-columns-from-a-table-using-beautifulsoup to get a certain column — depperm, Dec 28 '16 at 20:02
"Sorry, AHD.com's Free Hospital Information service is not available to your region.". You should post html code. — 宏杰李, Dec 29 '16 at 02:01
I'll try to do that later, I can't reach it right now. I also tried it without the period, didn't work. Will try your suggestion, thanks. — Sig, Dec 29 '16 at 02:46
I get the error `soup.findAll('table')[0] does not have the attribute tbody` — Sig, Dec 29 '16 at 18:19
I should also mention that I switched to BeautifulSoup from bs4 and to 2.7 to try to mach that code — Sig, Dec 29 '16 at 18:22

score 0 · Answer 1 · edited May 23 '17 at 10:30

Why not grab the entire page HTML as a string and then use a regular expression to parse it? Is that not where Python excels?

In case you are afraid of regex, here is a beginner-friendly tutorial: https://regexone.com/

The syntax for Python might be slightly different: https://docs.python.org/2/library/re.html

And I seriously hope you are not scraping phone numbers for nefarious purposes. I don't want a phone call from you :-).

Here is another Stack Overflow answer which gives a good starting regex: https://stackoverflow.com/a/123666/5129424

Here's a regex for a 7 or 10 digit number, with extensions allowed, delimiters are spaces, dashes, or periods:

^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$

Just because you "might mess it up" doesn't mean you shouldn't try it and test it. Regardless of what you do, you are either at the mercy of the structure of the page, which may change, or the format of the phone numbers, which may also change. There is no perfect solution.

Using a regex to parse HTML/XML is not exatly pleasant and it may even be dangerous if(and you will in this case) get some part of the regex wrong — Governa, Dec 28 '16 at 20:08
The book I'm learning all this from said regex isn't helpful, I might try it if I have to though. — Sig, Dec 29 '16 at 02:44
@Governa It's only dangerous if it is untested. And getting a regex "wrong" may be due to any number of things - format changes, malformed expressions, etc. — TinkerTenorSoftwareGuy, Dec 29 '16 at 20:54

Getting Table Attributes from a Website

1 Answers1