1

I am working with XML on an android app that sometimes leaves sentences bumped up against each other.

Like: First sentence.Another sentence

I know I need to use [a-z] (lowercase letters), [A-Z] (uppercase letters), and all digits ([0-9]?) to search before and after the period, and then add a space after the period.

Maybe something like:

myString = myString.replaceAll("(\\p{Ll})(\\p{Lu})", "$1 $2");

My searches and efforts have been useless so far, so any and all help is welcomed. Thanks

Dustin
  • 1,283
  • 2
  • 9
  • 8
  • 1
    Couldn't you come up with a better title than `I can not find this regex`? – devnull Feb 24 '14 at 07:29
  • 1
    Your title sounds like you've [lost your regex, and you need help finding it](https://xkcd.com/1313/). – user2357112 Feb 24 '14 at 07:31
  • Never parse XML with regex.XML is not a regular language.Use well known XML parsers instead.See this question : http://stackoverflow.com/questions/8577060/why-is-it-such-a-bad-idea-to-parse-xml-with-regex – Madusudanan Feb 24 '14 at 07:33
  • at the time of me making edits to XML, it is already a well formatted string – Dustin Feb 24 '14 at 07:34
  • At what point are these sentences stuck together without a space? Does the XML itself have sentences joined improperly, with no spaces or tags between them? – user2357112 Feb 24 '14 at 07:35
  • I have no idea where the problem occurs at. I am editing a string obtained through a RSS XML feed that mainly provides info on the web, but for some reason when I collect it to android, it comes up missing spaces like these. – Dustin Feb 24 '14 at 07:38

1 Answers1

3

You were almost there, you just forgot to match the dot:

myString = myString.replaceAll("(\\p{Ll})\\.(\\p{Lu})", "$1. $2");

And since you're not actually doing anything with the letter before and after the dot, you can speed things up a bit by using lookaround assertions:

myString = myString.replaceAll("(?<=\\p{Ll})\\.(?=\\p{Lu})", ". ");
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Of course, now we're putting extra spaces into acronyms written with periods. We could try to tell whether we're looking at an acronym, but then we run into *more* edge cases. Natural language correction is messy. – user2357112 Feb 24 '14 at 07:39
  • yes, but this is still missing the fact that it could be a number, lowercase letter, or uppercase letter before and after the period. – Dustin Feb 24 '14 at 07:40
  • I know this is a messy thing to edit... but there will be very very few of these cases I think – Dustin Feb 24 '14 at 07:41
  • If you also want to replace dots after uppercase letters and digits, just use `[\\p{L}\\d]` instead of `\\p{Ll}`, but then you'd also replace `C.I.A.` with `C. I. A.`. – Tim Pietzcker Feb 24 '14 at 08:12
  • @TimPietzcker: Didn't see that the lookarounds were specifically lowercase and uppercase. It means we're missing *different* weird edge cases, but C.I.A. is currently fine. – user2357112 Feb 24 '14 at 08:13
  • hi can i optimize this , am calling this like #1 in my web applilcationo for generating sql statments ? – shareef Mar 25 '18 at 14:18