0

I have to manually populate a spreadsheet of many webpages. I only have to pick up some details from every page such as its title, description, etc.. Doing this manually is becoming too monotonous and boring so I thought I could semi-automate it using Matlab.

Suppose this is the page as an example: http://www.smythstoys.com/uk/en-gb/video-games-tablets/c-751/xbox-one/p-14141/xbox-one-1tb-console/

I can read this page into matlab using:

page = urlread('..the_webpage..');

This basically copies its source code into a string variable. Viewing the source I can see that the Title is in its <title></title> tag and so is the description.

Is there any way I can extract these values from the string and into cell arrays. From there I can then move them easily to an excel spreadsheet. I tried using textscan but it did not work as I cannot tell the delimiter between the values.

Luis Mendo
  • 110,752
  • 13
  • 76
  • 147
StuckInPhDNoMore
  • 2,507
  • 4
  • 41
  • 73

1 Answers1

3

You would need to write a HTML parser in Matlab. Don't. There's a lot of projects that do that, because it's a very common task, but also a very very complex one.

Try python, and beautifulsoup, and write a python program that extracts the data for your matlab application. You can execute the python program from matlab, then.

Matlab is a mathematical processing language. Writing a HTML parser would be like cutting down a tree with a herring. Don't waste your PhD candidate's life with that. Learn a minimal bit of python and do general purpose problems with a general purpose language.

Marcus Müller
  • 34,677
  • 4
  • 53
  • 94
  • On a sidenote: I looked into the StackOverflow user that went insane and it does mention that you can in fact use REGEX on something that you know the format of and on a limited basis. So in this case grabbing stupid info like the title or other stuff of that nature would actually be doable. Even though it isn't the "best" idea. – ZaxLofful Sep 08 '15 at 22:55
  • Exactly! Don't use Matlab for this. Use Python, R, or Excel. – ASH Oct 11 '15 at 20:38