6

Currently I am developing a web application to fetch Twitter stream and trying to create a natural language processing by my own.

Since my data is from Twitter (limited by 140 characters) there are many words shortened, or on this case, omitted space.

For example:

"Hi, my name is Bob. I m 19yo and 170cm tall"

Should be tokenized to:

- hi
- my
- name
- bob
- i
- 19
- yo
- 170
- cm
- tall

Notice that 19 and yo in 19yo have no space between them. I use it mostly for extracting numbers with their units.

Simply, what I need is a way to 'explode' each tokens that has number in it by chunk of numbers or letters without delimiter.

'123abc' will be ['123', 'abc']

'abc123' will be ['abc', '123']

'abc123xyz' will be ['abc', '123', 'xyz']

and so on.

What is the best way to achieve it in PHP?


I found something close to it, but it's C# and spesifically for day/month splitting. How do I split a string in C# based on letters and numbers

Community
  • 1
  • 1
akhy
  • 5,760
  • 6
  • 39
  • 60

2 Answers2

9

You can use preg_split

$string = "Hi, my name is Bob. I m 19yo and 170cm tall";
$parts = preg_split("/(,?\s+)|((?<=[a-z])(?=\d))|((?<=\d)(?=[a-z]))/i", $string);
var_dump ($parts);

When matching against the digit-letter boundary, the regular expression match must be zero-width. The characters themselves must not be included in the match. For this the zero-width lookarounds are useful.

http://codepad.org/i4Y6r6VS

d_inevitable
  • 4,381
  • 2
  • 29
  • 48
  • Sorry, haven't obvisouly tested it. Didnt know codepad.org existed. Will make use of it now. – d_inevitable Apr 16 '12 at 20:04
  • @d_inevitable I don't really get your later explanation. Does it means there is some condition your regex cannot do it correctly? – akhy Apr 16 '12 at 20:27
  • 2
    No I am just explaining what the previous problem was when having something like `[a-z]\d` as the letter-digit boundary. That expression would produce `['a', 2]` from `'ab12'`, because `b1` would be interpreted as the boundary and thus excluded. – d_inevitable Apr 16 '12 at 20:32
  • Also remember cases like "Download Concrete5 CMS". – Xeoncross Feb 04 '13 at 16:09
1

how about this:

you extract numbers from string by using regexps, store them in an array, replace numbers in string with some kind of special character, which will 'hold' their position. and after parsing the string created only by your special chars and normal chars, you will feed your numbers from array to theirs reserved places.

just an idea, but imho might work for you.

EDIT: try to run this short code, hopefully you will see my point in the output. (this code doesnt work on codepad, dont know why)

<?php
$str = "Hi, my name is Bob. I m 19yo and 170cm tall";
preg_match_all("#\d+#", $str, $matches);
$str = preg_replace("!\d+!", "#SPEC#", $str);

print_r($matches[0]);
print $str;
xholicka
  • 173
  • 6
  • interesting, but also a little confusing to me.. could you give me some additional explanation? – akhy Apr 16 '12 at 20:13
  • answer edited, check it out. if you need more explanation, just ask, ill support whole solution after i get my sleep ;) – xholicka Apr 16 '12 at 20:32