how to read only URL from txt file in MATLAB

Question

I have a text file having multiple URLs with other information of the URL. How can I read the txt file and save the URLs only in an array to download it? I want to use

C = textscan(fileId, formatspec);

What should I mention in formatspec for URL as format?

I am not java savvy but I guess you could do it using java in Matlab, you can start by reading [*How to detect the presence of URL in a string*](http://stackoverflow.com/questions/285619/how-to-detect-the-presence-of-url-in-a-string) and [*Calling Java from MATLAB*](http://blogs.mathworks.com/community/2009/07/06/calling-java-from-matlab/). — p8me, Jul 01 '13 at 03:45

score 4 · Accepted Answer · edited May 23 '17 at 10:32

This is not a job for textscan; you should use regular expressions for this. In MATLAB, regexes are described here. For URLs, also refer here or here for examples in other languages.

Here's an example in MATLAB:

% This string is obtained through textscan or something
str = {...
    'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
    'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};


% find URLs    
C = regexpi(str, ...
    ['((http|https|ftp|file)://|www\.|ftp\.)',...
    '[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$]'], 'match');

C{:}

Result:

ans = 
    'http://www.example.com/index.php?query=test&otherStuf=info'
ans = 
    'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'

Note that this regex requires you to have the protocol included, or have a leading www. or ftp.. Something like example.com/universal_remote.cgi?redirect= is NOT matched.

You could go on and make the regex cover more and more cases. However, eventually you'll stumble upon the the most important conclusion (as made here for example; where I got my regex from): given the full definition of what precisely constitutes a valid URL, there is no single regex able to always match every valid URL. That is, there are valid URLs you can dream up that are not captured by any of the regexes shown.

But please keep in mind that this last statement is more theoretical rather than practical -- those non-matchable URLs are valid but not often encountered in practice :) In other words, if your URLs have a pretty standard form, you're pretty much covered with the regex I gave you.

Now, I fooled around a bit with the Java suggestion by pm89. As I suspected, it is an order of magnitude slower than just a regex, since you introduce another "layer of goo" to the code (in my timings, the difference was about 40x slower, excluding the imports). Here's my version:

import java.net.URL;
import java.net.MalformedURLException;

str = {...
    'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
    'pre--URL garbage example.com/index.php?query=test&otherStuf=info more stuff here'
    'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};


% Attempt to convert each item into an URL.  
for ii = 1:numel(str)    
    cc = textscan(str{ii}, '%s');
    for jj = 1:numel(cc{1})
        try
            url = java.net.URL(cc{1}{jj})

        catch ME
            % rethrow any non-url related errors
            if isempty(regexpi(ME.message, 'MalformedURLException'))
                throw(ME);
            end

        end
    end
end

Results:

url =
    'http://www.example.com/index.php?query=test&otherStuf=info'
url =
    'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'

I'm not too familiar with java.net.URL, but apparently, it is also unable to find URLs without leading protocol or standard domain (e.g., example.com/path/to/page).

This snippet can undoubtedly be improved upon, but I would urge you to consider why you'd want to do this for this longer, inherently slower and far uglier solution :)

+1: Nice study and good solution. However I don't think there would be a fast way to detect something without the standard scheme (like `example.com/path/to/page`) since the only way would need a connection to server and check the connectivity as [this answer](http://stackoverflow.com/a/1600333/1698972) suggests. — p8me, Jul 02 '13 at 14:09

score 3 · Answer 2 · edited May 23 '17 at 12:21

As I suspected you could use java.net.URL according to this answer.

To implement the same code in Matlab:

First read the file into a string, using fileread for example:

str = fileread('Sample.txt');

Then split the text with respect to spaces, using strsplit:

spl_str = strsplit(str);

Finally use java.net.URL to detect the URLs:

for k = 1:length(spl_str)
    try
       url = java.net.URL(spl_str{k})
       % Store or save the URL contents here
    catch e
       % it's not a URL.
    end
end

You can write the URL contents into a file using urlwrite. But first convert the URLs obtained from java.net.URL to char:

url = java.net.URL(spl_str{k});
urlwrite(char(url), 'test.html');

Hope it helps.

how to read only URL from txt file in MATLAB

2 Answers2

Linked