4

This website lists over 250 courses in one list. I want to get the name of each course and insert that into my mysql database using php. The courses are listed like this:

<td> computer science</td>
<td> media studeies</td>
…

Is there a way to do that in PHP, instead of me having a mad data entry nightmare?

Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
getaway
  • 8,792
  • 22
  • 64
  • 94
  • Unless you need to refresh the database from the list very often, I'd suggest that you simply save the page as a html file, and then write a simple jQuery script that takes the text from each TD and stitches together an SQL string that you print out in a textarea or to the Firebug console or something. – Splashdust Oct 15 '10 at 22:16
  • i really only need the courses name for autosuggest feature really, so yeh i actually consider that, but im not very god at jquery, im so dumb lol :)) – getaway Oct 15 '10 at 22:18
  • *(related)* [Best methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon Oct 15 '10 at 22:40

5 Answers5

4

Regular expressions work well.

$page = // get the page
$page = preg_split("/\n/", $page);
for ($text in $page) {
    $matches = array();
    preg_match("/^<td>(.*)<\/td>$/", $text, $matches);
    // insert $matches[1] into the database
}

See the documentation for preg_match.

Peter C
  • 6,219
  • 1
  • 25
  • 37
  • oh i love this,,, this is exactly what i need, but can you elobrate on how im going to get the page! in terms of inserting, do you just insert $matches[1] into the database, or deos it have to change to $matches[2] ect.. – getaway Oct 15 '10 at 22:25
  • Just insert $matches[1] into the database. It will be updated every iteration of the loop. An easy way to get the page is `file_get_contents("http://your-url.com/page.html")`. – Peter C Oct 15 '10 at 22:42
  • [obligatory link telling you Regex aint for parsing HTML](http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html) – Gordon Oct 15 '10 at 22:47
  • 1
    Yeah, I know, but for a quick-and-dirty job like this that he's only gonna use once and he already knows the structure of the HTML, regexes are really convenient. Besides, if he wants maintainable, bug-free code he should stay away from PHP.... – Peter C Oct 15 '10 at 22:56
  • They are not any more convenient than using a proper parser. And please keep the language bias away. No language is bugfree and there is no reason why you would not be able to create a maintainable application with PHP (unless you are a bad developer). – Gordon Oct 15 '10 at 23:06
  • I got 6 lines (took me less than 2 minutes), not including insertion into the database. And there is a difference between a bug-free language and a language that makes it easy to write buggy code. – Peter C Oct 15 '10 at 23:15
  • It takes 5 lines of code with DOM (excluding insertion). It takes less than 1 minute to write. And it's much more reliable than your Regex. And I still dont see why PHP should make it any more easier than any other scripting language to write buggy code. – Gordon Oct 16 '10 at 08:55
  • Nice. If you know how to use the library, of course.... Doesn't downloading, installing, and learning how to use a full HTML-parsing library seem a little overkill to you? And PHP never warns you when it should, is way too loosely typed, and supports combining presentation and logic by embedding PHP directly in HTML. – Peter C Oct 16 '10 at 15:47
  • No, I dont think that is overkill. DOM is a native extension of PHP and enabled by default, so there is nothing to download or install. DOM is an implementation of the language agnostic W3C DOM interface, so chances are the OP already knows it from another language. With five lines of code there isnt much to learn and Regex patterns and functions have to be learned too, so you hardly have an argument. I wont argue about your other claims since they are nonsense. Maybe you should heed your own advice to the OP instead and not use PHP (or give ill advise about it ftm). – Gordon Oct 16 '10 at 22:50
  • Sorry. I am not a PHP programmer and didn't know DOM stuff was built in. I saw other people suggesting separate libraries and assumed it wasn't. In the languages that I come from regexes would be the most convenient (though certainly not the best) way to do this. Also, I couldn't help but notice that your "5-line 1-minute" answer is missing. Consider posting it if you want to show him the "real way to do it". And yes, I don't use PHP (for the reasons I stated, plus the community seems obnoxious) and wouldn't recommend it to anybody but he specifically said "is there a way to do this in PHP". – Peter C Oct 17 '10 at 02:22
  • I dont post it because the [question is a duplicate](http://stackoverflow.com/search?q=dom+regex+php). The problem is [finding the question to closevote is tedious](http://stackoverflow.com/questions/3650125/how-to-parse-html-with-php). One OP wants to parse td elements while the other wants img elements. The approach is always the same, yet there is at least one question like this daily. [I've answered so many](http://stackoverflow.com/search?q=user%3A208809+dom+regex) of them by now, that I got the Regex Badge for providing DOM solutions. It's just tiring by now. – Gordon Oct 17 '10 at 10:51
  • I dont agree about your comments about the PHP community. If you go to a PHP conference (that's where the community is) you will notice the people are quite cheer- and helpful and open-minded. Of course, when you approach them with bias and silly arguments about how PHP doesnt do this or that, they likely wont react like that anymore. PHP is what it is (and it is very successful the way it is). Yes, it is not perfect but neither are other languages. – Gordon Oct 17 '10 at 11:09
  • I'm not even going to bother responding to something as subjective as this. Suffice it to say, if you want something done right, do it yourself. So if you want him to do things your way, show him your way. – Peter C Oct 17 '10 at 18:11
  • Yes, actually you're right on this one. I've spent so much time on this question already that I might as well add the DOM solution (actually I already did). Of course I am being subjective on the community (they gave me free beer) but I dont think [PHP's success](http://blogs.gartner.com/mark_driver/2009/12/03/php-past-present-and-future/) can reasonably be denied. You are free to think different and you are also free to nurture your bias against PHP. But to repeat myself, I'd just appreciate if you'd keep it to yourself (at least within StackOverflow's PHP tag - it's not helpful). – Gordon Oct 17 '10 at 22:03
  • Okay, the DOM solution is definitely nicer. PHP is (very) successful because it is easy to learn and use (and that's also the reason I think it is bad; it combines the presentation and logic of a page). And I'm allowed to express my opinions about PHP, thank you. All I said was if he wanted more maintainable code, he should stay away from PHP (it's easier for learning developers to abuse PHP than, say, Python or Scheme). – Peter C Oct 17 '10 at 22:20
  • Whether you do combine presentation and logic of a page or not is up to you. If you are just going to whip out a small homepage it's perfectly fine to do so. Keep it simple. If you are going to write an enterprise webapp, you probably will use MVC and that's very much possible (and encouraged and established) too. As for PHP being easy to learn, yes that's true, but dont tell me you havent see sloppy "professional" code in languages that are hard to learn. Bad code exists in every language. – Gordon Oct 17 '10 at 22:48
  • Yeah, I admit, I've prototyped things in PHP before. I never would be foolish enough to use it in production though. It doesn't even auto-escape HTML (at least if it does, it isn't on by default). I've seen sloppy Java code, believe me. And it's not that the language is too easy to learn, it's not strict enough, IMO. – Peter C Oct 18 '10 at 03:20
  • I dont think it's fair to blame the entire language just because it doesnt do a particular thing. You could still write your own output function that wraps `echo` and `htmlentities` for escaping, so it's not a big thing. I'm not sure in what regard you consider PHP being not strict enough, but I am also quite sure that you wouldn't convince me of it anyway. For me, PHP is fine. I make a living from it. I agree PHP is not an overly pretty language. But ultimately any language is just a tool. And PHP does very well as a tool. – Gordon Oct 18 '10 at 07:41
  • I'm not blaming the whole language because of it. I'm saying you have to be very careful using it in production environments. PHP is not strict enough means it is too easy to make messy bad code. I've seen bad Python, but not as bad as some PHP I've seen, because Python is more structured. And I've never seen bad Scheme, but that's because one doesn't find much Lisp code out there. I don't make a living off any language (because I'm 14) but for me, PHP isn't good enough. – Peter C Oct 18 '10 at 17:36
  • 1
    Anyway, you're entitled to your opinion about PHP as much as I am. So let's just agree to disagree, 'cause I'm getting tired of this. – Peter C Oct 18 '10 at 17:38
4

How to parse HTML has been asked and answered countless times before. While (for your specific UseCase) Regular Expressions will work, it is - in general - better and more reliable to use a proper parser for this task. Below is how to do it with DOM:

$dom = new DOMDocument;
$dom->loadHTMLFile('http://courses.westminster.ac.uk/CourseList.aspx');
foreach($dom->getElementsByTagName('td') as $title) {
    echo $title->nodeValue;
}

For inserting the data into MySql, you should use the mysqli extension. Examples are plentiful on StackOverflow. so please use the search function.

Community
  • 1
  • 1
Gordon
  • 312,688
  • 75
  • 539
  • 559
2

You can use this HTML parsing php library to achieve this :http://simplehtmldom.sourceforge.net/

greg0ire
  • 22,714
  • 16
  • 72
  • 101
  • 1
    Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org). – Gordon Oct 15 '10 at 22:40
0

I encountered the same problem. Here is a good class library called the html dom http://simplehtmldom.sourceforge.net/. This like jquery

Sam
  • 155
  • 2
  • 7
0

Just for fun, here's a quick shell script to do the same thing.

curl http://courses.westminster.ac.uk/CourseList.aspx \
| sed '/<td>\(.*\)<\/td>/ { s/.*">\(.*\)<\/a>.*/\1/; b }; d;' \
| uniq > courses.txt
Dagg Nabbit
  • 75,346
  • 19
  • 113
  • 141