1

I have a string. (the input string is always an English sentence and its translation in another Lang. but in one line, without limiter.)

String str = "2019雨降るしですね。It rains 2019."; 

how can I separate it into two?

2019雨降るしですね。

It rains 2019.

I tried this, but failed..

                String aString = "2019/1/1,なにげない日々。2019/1/1 is a simple day.";
                Pattern pat = Pattern.compile("([\\p{InHiragana}]+)"); 
                Matcher m = pat.matcher(aString);
                System.out.println(m.find()); // true
                String firstHour = m.group(0);
                System.out.println(firstHour);      
Cœur
  • 37,241
  • 25
  • 195
  • 267
manhon
  • 683
  • 7
  • 27

2 Answers2

0

\W can be an option for characters not in [a-zA-Z_0-9] set.

A quick solution for your first case: (\\d{4})(\\W+)(\\s*)(.*)

The_Cute_Hedgehog
  • 1,280
  • 13
  • 22
  • is there a more generic way. the input is an English sentence and its translation in another Lang. but in one line. thanks a lot. – manhon Dec 28 '19 at 16:10
0

I'd recommend you instead try and improve the format you are receiving the data in, as this problem cannot be solved with 100% accuracy. That being said, here's one approach that will work for most cases:

  1. Split string into words (e.g. .split(" "))
  2. For first item in array:
    1. Check if word is all English letters (if it's all numbers, move to next word).
    2. Store this value.
  3. For every other item in array:
    1. Check if word is all english letters.
    2. If it is, and the previous word was non-english letters, you have your breakpoint.
  4. Merge back together the first X words and last X words on either side of the breakpoint.

You'll now have 2 strings. One with your non-English string, one with your English string. You'll have to do a lot of testing, and likely improve the approach towards numbers (split on numbers using regex?), but this is a start.

Jake Lee
  • 7,549
  • 8
  • 45
  • 86