How to extract character ngram from sentences? - python

Question

The following word2ngrams function extracts character 3grams from a word:

>>> x = 'foobar'
>>> n = 3
>>> [x[i:i+n] for i in range(len(x)-n+1)]
['foo', 'oob', 'oba', 'bar']

This post shows the character ngrams extraction for a single word, Quick implementation of character n-grams using python.

But what if i have sentences and i want to extract the character ngrams, is there a faster method other than iteratively call the word2ngram()?

What will be the regex version of achieving the same word2ngram and sent2ngram output? would it be faster?

I've tried:

import string, random, time
from itertools import chain

def word2ngrams(text, n=3):
  """ Convert word into character ngrams. """
  return [text[i:i+n] for i in range(len(text)-n+1)]

def sent2ngrams(text, n=3):
    return list(chain(*[word2ngrams(i,n) for i in text.lower().split()]))

def sent2ngrams_simple(text, n=3):
    text = text.lower()
    return [text[i:i+n] for i in range(len(text)-n+1) if not " " in text[i:i+n]]

# Generate 10000 random strings of length 100.
sents = [" ".join([''.join(random.choice(string.ascii_uppercase) for j in range(10)) for i in range(100)]) for k in range(100)]

start = time.time()
x = [sent2ngrams(i) for i in sents]
print time.time() - start        

start = time.time()
y = [sent2ngrams_simple(i) for i in sents]
print time.time() - start        

print x==y

[out]:

0.0205280780792
0.0271739959717
True

EDITED

The regex method looks elegant but it performs slower than iteratively calling word2ngram():

import string, random, time, re
from itertools import chain

def word2ngrams(text, n=3):
  """ Convert word into character ngrams. """
  return [text[i:i+n] for i in range(len(text)-n+1)]

def sent2ngrams(text, n=3):
    return list(chain(*[word2ngrams(i,n) for i in text.lower().split()]))

def sent2ngrams_simple(text, n=3):
    text = text.lower()
    return [text[i:i+n] for i in range(len(text)-n+1) if not " " in text[i:i+n]]

def sent2ngrams_regex(text, n=3):
    rgx = '(?=('+'\S'*n+'))'
    return re.findall(rgx,text)

# Generate 10000 random strings of length 100.
sents = [" ".join([''.join(random.choice(string.ascii_uppercase) for j in range(10)) for i in range(100)]) for k in range(100)]

start = time.time()
x = [sent2ngrams(i) for i in sents]
print time.time() - start        

start = time.time()
y = [sent2ngrams_simple(i) for i in sents]
print time.time() - start        

start = time.time()
z = [sent2ngrams_regex(i) for i in sents]
print time.time() - start  

print x==y==z

[out]:

0.0211708545685
0.0284190177917
0.0303599834442
True

score 6 · Accepted Answer · 2014-03-15T19:46:24.340

6

Why not just (?=(...))

edit Same thing, but not whitespace (?=(\S\S\S))
edit2 You can use just what you want as well. Ex. uses alphanum only (?=([^\W_]{3}))

Uses a lookahead to capture 3 characters. Then the engine bumps the position up 1 time each
match. Then captures next 3.

Result of foobar is
foo
oob
oba
bar

 # Compressed regex
 #  (?=(...))

 # Expanded regex
 (?=                   # Start Lookahead assertion
      (                     # Capture group 1 start
           .                     # dot - metachar, matches any character except newline
           .                     # dot - metachar
           .                     # dot - metachar
      )                     # Capture group 1 end
 )                     # End Lookahead assertion

edited Mar 15 '14 at 19:46

answered Mar 15 '14 at 19:19

pardon my noobiness, what is `(?=(...))`? can you give a working example? i've tried: `(?=('foobar'))` but got a syntax error. – alvas Mar 15 '14 at 19:21
Added some comments, `(?=(...))` is the regex. I don't know Python, but it should be used as a regular expression in match all (get an array output) context. – Mar 15 '14 at 19:28
`re.findall(r'(?=(...))','foobar')`, out: `['foo', 'oob', 'oba', 'bar']`. – alvas Mar 15 '14 at 19:30
1

Cool ... Is that what you are looking for? – Mar 15 '14 at 19:31
`[i for i in re.findall(r'(?=(...))','foobar like') if not " " in i]` – alvas Mar 15 '14 at 19:31
yeah, let me try to profile, it and see how much time i save =). Thanks for the neat regex trick. Is there a way to not check for the `" "`? – alvas Mar 15 '14 at 19:32
So, not whitespace then. Hang on a sec. – Mar 15 '14 at 19:33
2

Yeah, so same thing, no whitespace `(?=(\S\S\S))` – Mar 15 '14 at 19:35
@sin, it's a nice regex but it performs much slower than iterating through individual words =( – alvas Mar 15 '14 at 20:47
1

@alvas - A slowdown could be `rgx =` should be compiled only once, not for each sentence. It should be pre-compiled before the iteration. And you could also improve the speed of the regex by %10-15 if you actively move the match position. i.e, `/(?=(\S\S\S))./` Add the Dot-All modifier (same as `/(?=(\S\S\S))[\S\s]/` or `/(?s)(?=(\S\S\S))./`). – Mar 16 '14 at 21:13
It solved my problem; without even getting near to your approach! Thanks! – Akbar Hussein May 05 '20 at 04:40

How to extract character ngram from sentences? - python

1 Answers1

Linked