Currently I am developing a web application to fetch Twitter stream and trying to create a natural language processing by my own.
Since my data is from Twitter (limited by 140 characters) there are many words shortened, or on this case, omitted space.
For example:
"Hi, my name is Bob. I m 19yo and 170cm tall"
Should be tokenized to:
- hi
- my
- name
- bob
- i
- 19
- yo
- 170
- cm
- tall
Notice that 19
and yo
in 19yo
have no space between them. I use it mostly for extracting numbers with their units.
Simply, what I need is a way to 'explode' each tokens that has number in it by chunk of numbers or letters without delimiter.
'123abc'
will be ['123', 'abc']
'abc123'
will be ['abc', '123']
'abc123xyz'
will be ['abc', '123', 'xyz']
and so on.
What is the best way to achieve it in PHP?
I found something close to it, but it's C# and spesifically for day/month splitting. How do I split a string in C# based on letters and numbers