0

I've been reading in HTML files in Matlab with readfile, with the interest of using regexp to extract data from it. The function is returning the data the file as a string, which preserves the 'structure' of the HTML file, for example newlines. For example, if you try to do a file read on a file with the below contents it will return a string with the same structure.

<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML>
   <HEAD>
      <TITLE>
     A Small Hello
      </TITLE>
   </HEAD>
</HTML>

I'm looking for a function that will return a continuous string like ...

<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN"> <HTML> <HEAD> <TITLE> A Small Hello </TITLE> </HEAD> <BODY> <H1>Hi</H1> <P>This is very minimal "hello world" HTML document.</P> </BODY> </HTML>

This format will assist in my regexp endeavours.

Many thanks, Bob M

Amro
  • 123,847
  • 25
  • 243
  • 454
Bob M.
  • 55
  • 1
  • 3
  • 7
  • 1
    [Do not use regular expressions to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454), use a proper HTML parser instead. – You Jul 11 '11 at 14:20

2 Answers2

1

Regular expressions can do that:

str = fileread('file.html');
str = regexprep(str,'\s*',' ');   %# replace multiple whitespaces with a space
Amro
  • 123,847
  • 25
  • 243
  • 454
1

A quick way to jam these things together might be to import the data then concatenate them using strcat.

The code

imported_string = importdata(filename)
imported_string_together = strcat(imported_string{:})

produces the following output

imported_string = 

    '<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN">'
    '<HTML>'
    '   <HEAD>'
    '      <TITLE>'
    '     A Small Hello'
    '      </TITLE>'
    '   </HEAD>'
    '</HTML>'


imported_string_together =

<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN"><HTML>   <HEAD>      <TITLE>     A Small Hello      </TITLE>   </HEAD></HTML>

but this isn't really efficient.

I find that it is sometimes useful to go back to fopen/fread/fscanf type functions to quickly load things in a predictable manner. For example, you can use the following code to create what you want without so much copying and and other nonsense:

filename = 'test.html';
maxReadSize = 2^10;

fid = fopen(filename);
mystr = fscanf(fid, '%c', maxReadSize)

to produce the following output:

<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN"><HTML>   <HEAD>      <TITLE>     A Small Hello      </TITLE>   </HEAD></HTML>
</HTML>
Steve
  • 3,957
  • 2
  • 26
  • 50
  • Thanks Steve, this works great. I found the `regexprep` a little better for my particular problem when there is problematic whitespace in the mix too. – Bob M. Jul 11 '11 at 14:27