1

I am matching each words against identical paragraph words.

Update 1: I realise just accepting punctation you need does not solve this issue.

Example 'hello-' and 'hello' , are consider seperate word.

Is there a way to remove punctuation before and after word and stand alone punctuation? Only allow punctutation within word.

$string="_ - – hello’ hello' hello, hello- world. he,llo hello-world hello_world hel-lo-world hello9world"; 

The output should be

hello hello hello hello world he,llo hello-world hello_world hel-lo-world hello9world

Only word or punctuation within word

Update 2: If word only or punctuation within word, decimal number will have issue.

1.0 still ok, .1 as punctuation remove before and after, will become 1 instead of 0.1

Update 3: With accepting punctuation in word, Substrings start or end with a letter or a number will have issue. 20-year-old will become '20-' 'year-old'.

Thanks mickmackusa.

kiki
  • 25
  • 6

1 Answers1

1

Pattern: /[a-z\d]+(?:[-_’',.][a-z\d]+)*/iu (Pattern Demo)

This pattern demands that all matching substrings start with a letter or a number. The substrings may contain a punctuation character (any of the ones in the character class [-_’',.]) but it must be immediately followed by one or more letters or numbers. The * means zero or more of the preceding parenthetical expression, so substrings can be valid whether they contain a non-alpha-numeric character or not.

This pattern will not match a substring with two consecutive non-alpha-numeric characters as one match. For example: 20--what will not return 20--what, it will be 20 and what.

*if you want to allow ANY non-white-space character in the middle of the string, you can use this: /[a-z\d]+(?:\S[a-z\d]+)*/iu

The i flag allowd [a-z] to match uppercase occurrences as well.
The u flag allows unicode characters like .

PHP Code: (Demo)

$string="_ - – hel’lo’ hel'lo' .1 1.0 1. hello, hello- world. he,llo hello-world hello_world hel-lo-world hello9world -20 20- 20-year -20year- -20-year- 20-year-old 20-yearold 20year-old 20-year-old-old 20-20-year-20-old-";
echo preg_match_all("/[a-z\d]+(?:[-_’',.][a-z\d]+)*/iu",$string,$out)?implode(' ',$out[0]):'fail';

Output:

hel’lo hel'lo 1 1.0 1 hello hello world he,llo hello-world hello_world hel-lo-world hello9world 20 20 20-year 20year 20-year 20-year-old 20-yearold 20year-old 20-year-old-old 20-20-year-20-old
mickmackusa
  • 43,625
  • 12
  • 83
  • 136
  • Here is a new pattern. https://regex101.com/r/huvfC1/10 I've got to shuttle my kids around so I'll be away from my computer for a while. Let me know how this works for you. – mickmackusa Nov 13 '17 at 05:54
  • Ah yes. Notice in my demo that I changed the pattern delimiter to `~` – mickmackusa Nov 13 '17 at 06:25
  • I am home for a few minutes, so I can explain with a bit more detail... The default / most popular pattern delimiter is `/`. However, if you use `/` in your pattern it must be escaped by writing `\/`. Changing the delimiter to `~` avoids escaping `/`'s in the pattern, but then requires `~` to be escaped as `\~`. So, actually, there is no advantage in changing the delimiter. Now that I understand the huge list of punctuation in your pattern, I must change my advice. Keep the delimiter as `/`, just escape all forward slashes that occur inside the pattern like: `\/` – mickmackusa Nov 13 '17 at 06:54
  • @kiki http://sandbox.onlinephpfunctions.com/code/b3dde568977650d38a02c7f8a038bec04b9ab630 and https://regex101.com/r/M97FkV/26 – mickmackusa Nov 13 '17 at 07:14
  • http://sandbox.onlinephpfunctions.com/code/00b846eb2bba8591ac7e0961ed7ff00c99f24e3d (explanations in the demo) – mickmackusa Nov 13 '17 at 07:50
  • @kiki Here's some more regex education to soak up: http://sandbox.onlinephpfunctions.com/code/9b0b1c2959b81f307bd616e612eae8f0c92fe226 – mickmackusa Nov 13 '17 at 12:37
  • @mick Nice solution. I've pulled my answer and upvoted yours since the question was updated after I answered. Given the userbase for SO I wasn't aware that users were looking for "professional grade methods" given the level of experience in questions. Usually users are not ready to understand such solutions *as you know.* In my experience providing wildly complicated regex patterns doesn't teach people regex, but rather how to come back and ask for more help. Don't let the perfect be the enemy of the good. – AbsoluteƵERØ Nov 13 '17 at 18:42
  • @AbsoluteƵERØ I think you are underestimating the vast StackOverflow audience. Devs of all levels and backgrounds come here for wisdom. It is foolish to assume that future readers _are not ready_ for this wisdom. In fact, it is a disservice to all (askers, answerers, and future reasearchers) to provide anything less than your very best methods and explanations. We should all aspire to post perfect work and be content when it is merely great. If complex regex is the best method, then post it and explain it in great detail. Never throttle your best ideas. – mickmackusa Nov 14 '17 at 05:56