1

A web page contains some data displayed by some columns and delimited by the tag "pre" "/pre" :

ColumnA   ColumnB  ColumnC   ColumnD   ColumnE

01/2050   1009.0     11         9    
01/1950   1009.0                8    
01/1850   1009.0     11         8         82
01/1750   1009.0     10         87
01/1650   1008.0     10         7         82
01/1550   1008.0     11         8         82

I get them with the following code

s = regexp(urlpage, '<PRE[^>]*>(.*?)</PRE>', 'tokens');
s = [s{:}]';

%token to rows (cell)
row = textscan(s{1}', '%s', 'delimiter', '\n'); 

but in this situation i don't know the value of all elements, I would like to read each of them, i tried with

splitstring = textscan(row{1}{r},'%s');

and with

splitstring = textscan(row{1}{r},'%s  %f %d %d %d');

but the whitespace aren't detected! Such as in the second row i detect a {3x1 cell} not a {5x1 cell}.

Mixo
  • 191
  • 13
  • try `regexp` with the period metachar `.` to find those white spaces. [Link](http://www.mathworks.co.uk/help/matlab/matlab_prog/regular-expressions.html) – The-Duck Mar 18 '14 at 22:00
  • @ Kirby in fid there are some "tokens" created by t = regexp(html, '
    ]*>(.*?)
    ', 'tokens');
    – Mixo Mar 18 '14 at 22:19
  • @The-Duck example? I do not understand how to look for white spaces and then to be able to exclude the row – Mixo Mar 18 '14 at 22:34
  • @Mixo, First get all rows, then exclude the ones you don't want. Because there are some whitespaces that are not a problem, try grouping the columns using regexp to detect the usual form with the correct whitespace count. I will make an example tomorrow if I can. Sorry I can't help more at this point in time. – The-Duck Mar 18 '14 at 23:01
  • ..alternatively replace them with NaN without "deleting" rows. any tips? – Mixo Mar 18 '14 at 23:11
  • @The-Duck thanks, I look forward to your solution – Mixo Mar 19 '14 at 14:10

1 Answers1

0

sorry it took a bit but here it is:

As I said, I first load the data using textscan by line (\n delimiter). I can then evaluate each line separately and see if it matches the regular expression specified by:

'\w*/\w*......\w*\.\w.....\d\d..\d...\d\d'

The meta characters are as described here

Next I just loop through the values to get only the ones who matched (as the others were not matched by the regular expression)

There is a way to vectorize this but this simple loop should do the trick for now. Also note that this is a very proprietary method to detect a pattern and such any change in the character spacings between the data columns will have to be met with a new meta character string to match.

Final matching rows are contained within cell y

clear
clc

ftoread = 'text.txt';
fid = fopen(ftoread);
data = textscan(fid,'%s','Delimiter','\n','EmptyValue',NaN);
fclose(fid);

x = data{1}
c=1;
for ind=1:size(x,1)
    m = regexp(x{ind},'\w*/\w*......\w*\.\w.....\d\d..\d...\d\d','match');
    if ~isempty(m)
        y{c} = m;
        c=c+1;
    end
end

text.txt

01/2050      1009.0     11  9   87
01/1950      1009.0         8   93
01/1850      1009.0     11  8   82
01/1750      1009.0     10      87
01/1650      1008.0     10  7   82
01/1550      1008.0     11  8   82
01/1450      1008.0                 82

Hope this is still relevant

The-Duck
  • 501
  • 5
  • 9
  • Thank you for you interest! Your suggestion is clear but I don't match any data! There is no limit to the number of digits (can also be negative) on the last three columns I tried with: m = regexp(x{ind},'\w*/\w*\s*\d*\.\d*\s*\d*\s*\d*\s*\d','match') but it doesn't match – Mixo Mar 21 '14 at 09:30
  • @Mixo, I believe that when specifying `\s`, you are specifying a single white-space char and the addition of the astrix may fail (not sure about that though). As you require the full line to be detected as complete with all values, then you will need to detect the entire line as matching rather than sequences. Use word detection where a zero before a value is important and maybe even for negative values. You may also wish to try detecting each column separately so you would run `regexp` 5 times per line and detect each column separately. – The-Duck Mar 21 '14 at 11:29
  • Each column is a unit of measure, but can be "empty" (not present) – Mixo Mar 22 '14 at 15:45
  • Summary of my problem: Each column is a unit of measure, but can be "empty" (not present). All these data are read from a web page where these ones are formatted with
     tags! The columns don't have a separator, are limited only with blank space (one or more!)
    – Mixo Mar 22 '14 at 15:57
  • Then have a look at [this](http://stackoverflow.com/questions/22213057/matlab-text-string-html-parse/22226034#22226034). I believe with the background you amounted so far you should be ok. – The-Duck Mar 22 '14 at 17:12
  • I get fair the "data token" with urlread (all data are limited with
     tag and are in s{1}), next difficult step is scan the elements of each row s{1}{row}
    – Mixo Mar 22 '14 at 17:34
  • I suggest that you either edit your question or post a new one as this is becoming hard to understand as I can only assume your problem or conditions changed. – The-Duck Mar 22 '14 at 18:06
  • I edited my previous question, now the scenario is complete. Thank you The Duck! – Mixo Mar 23 '14 at 00:31