Get first character of each word and its position in a sentence/paragraph

Question

I am trying to create a map by taking the first character of each word and it's position in a sentence/paragraph. I am using regex pattern to achieve this. Regex is a costly operation. Are there are any ways to achieve this?

Regex way:

public static void getFirstChar(String paragraph) {
    Pattern pattern = Pattern.compile("(?<=\\b)[a-zA-Z]");
    Map newMap = new HashMap();

    Matcher fit = pattern.matcher(paragraph);
    while (fit.find()) {
        newMap.put((fit.group().toString().charAt(0)), fit.start());
    }
}

You wouldn’t need the look-behind assertion as `\b` already is an assertion that doesn’t consume any character. — Gumbo, May 11 '10 at 07:31
Don't use raw types in new code! http://stackoverflow.com/questions/2770321/what-is-a-raw-type-and-why-shouldnt-we-use-it/ — polygenelubricants, May 11 '10 at 07:41
"Regex is a costly operation" - did you profile, or is that your assumption? This particular regex shouldn't be too costly and is asymptotically the same as doing your own linear scan. — polygenelubricants, May 11 '10 at 07:44
Don't use raw types in new code! - This was just a sample code that I wrote to post it in the forum. Sure will take care. did you profile, or is that your assumption - The response time from the method using this operation was long. Hence tried for a better option. — Radhika, May 12 '10 at 06:38

bboe · Answer 1 · 2010-05-11T07:39:02.097

0

Python:

wmap = {}
prev = 0
for word in "the quick brown fox jumps over the lazy dog".split():
    wmap[word[0]] = prev
    prev += len(word) + 1

print wmap

If a letter appears more than once as the first letter of a word it'll map to the last position. For a list of all positions change wmap[word[0]] = prev to:

if word[0] in wmap:
    wmap[word[0]].append(prev)
else:
    wmap[word[0]] = [prev]

edited May 11 '10 at 07:39

answered May 11 '10 at 07:31

bboe

4,092
3
29
39

It should be pretty easy to translate this code to java. I'm pretty sure split is the same, and your use of map is the same. Can't make it too easy for ya :) – bboe May 11 '10 at 07:41
This `split` works for spaces. The original code finds each letter following a non-letter, which might be an apostrophe (it finds the `m` in `I'm`). – Christian Semrau May 11 '10 at 21:14

score 0 · Accepted Answer · answered May 11 '10 at 07:54

You can do your own linear scan if you really need to squeeze every bit of performance:

                 //0123456789012345678901
    String text = "Hello,my name is=Helen";
    Map<Character,Integer> map = new HashMap<Character,Integer>();

    boolean lastIsLetter = false;
    for (int i = 0; i < text.length(); i++) {
        char ch = text.charAt(i);
        boolean currIsLetter = Character.isLetter(ch);
        if (!lastIsLetter && currIsLetter) {
            map.put(ch, i);
        }
        lastIsLetter = currIsLetter;
    }

    System.out.println(map);
    // prints "{n=9, m=6, H=17, i=14}"

API links

Character.isLetter

@Radhika: consider accepting the answer (clicking on the green check on the left) if an answer is satisfactory. If you still need help with the question, explain your concerns and I'll try to address it. — polygenelubricants, May 12 '10 at 07:31

Get first character of each word and its position in a sentence/paragraph

2 Answers2

API links