0

Is there a nice and clean way to find strings of capital letters of size 2-4 in length within a larger string in matlab. For example, lets say I have a string...

 stringy = 'I imagine I could FLY';

Is there a nice way to just extract the FLY portion of the string? Currently I'm using the upper() function to identify all the characters in the string that are upper case like this...

 for count = 1:length(stringy)
     if upper(stringy(count))==stringy(count)
          isupper(count)=1;
     else
          isupper(count)=0;
     end
 end

And then, I'm just going through the binary vector and identifying when there there are 2-4 1's in the row.

This method is working... but I'm wondering if there is a cleaner way to be doing this... thanks!!!

Flaminator
  • 564
  • 6
  • 17
  • This may help...
    http://stackoverflow.com/questions/4598315/regex-to-match-only-uppercase-words-with-some-exceptions Good Luck.
    – Raathigesh Jan 24 '12 at 04:15

1 Answers1

4

You can use regular expressions for this. The regular expression [A-Z]{2,4} will search for 2-4 capital letters in a string.

The corresponding matlab function is called regexp.

regexp(string,pattern) returns subindexes into string of all the places it matches pattern.

For your pattern I have two suggestions:

  1. \<[A-Z]{2,4}\>. This searches for whole words that consist of 2-4 capital letters (so it doesn't grab TOUCH below):

    stringy = 'I imagine I could FLY and TOUCH THE SKY';
    regexp(stringy,'\<[A-Z]{2,4}\>') % returns 19, 33, 37 ('FLY','THE','SKY')
    

    (Edit: Matlab uses \< and \> for word boundaries not the standard \b).

  2. If you have strings where case can be mixed within a word and you want to extract those, try (?<![A-Z])[A-Z]{2,4}(?![A-Z]) (which means "2-4 capital letters that aren't surrounded by capital letters):

    stringy = 'I image I could FLYandTouchTHEsky';
    % returns 17 and 28 ('FLY', 'THE')
    regexp(stringy,'(?<![A-Z])[A-Z]{2,4}(?![A-Z])') 
    
    % note '\<[A-Z]{2,4}\>' wouldn't match anything here since it looks for
    % *whole words* that consist of 2-4 capital letters only.
    % 'FLYandTouchTHEsky' doesn't satisfy this.
    

Pick the regex based on what behaviour you want to occur.

mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • this looks like it will work perfectly! thanks! one question... when I try the approach you have in point #1, i get an empty output. Why would that be? (I mean I used the two statements you did at the console) – Flaminator Jan 24 '12 at 04:34
  • Huh, it turns out Matlab uses `\<` and `\>` for word boundaries instead of `\b`, I'll update my answer (octave uses the more standard '\b'...) – mathematical.coffee Jan 24 '12 at 04:46
  • Perfect, thanks! So this approach will return the indices of the start of the capital letter sections, but will not indicate whether it is a string of 2,3 or 4 capital letters right? – Flaminator Jan 24 '12 at 05:09
  • 1
    Figured out my above clarification. If you return two parameters (i.e [a b]=regexp... from the regexp then the first set will indicate where the capital letter string starts, and the second set will indicate where it ends. – Flaminator Jan 24 '12 at 05:39