4

I would like to know if there is a easy way to extract the first encountered date from a String in Java.

My program will analyse a lot of String texts, in different languages. These Strings can contain a date. Because of the languages (and the different sources), I have an awful lot of formats to take into consideration.

I first thought about Regex, making one regex for each format I could find... But there are an awful lot, for exemple "Month (d)d, yyyy" or "mm/dd/yyyy" or "dd-mon-yyyy"...

So I wanted to know if there is an easier way to extract date from a String maybe by using DateFormat, so I can convert the found date to "dd/mm/yyyy".

Thank you for your help. ^^

Community
  • 1
  • 1
Malik
  • 207
  • 1
  • 2
  • 14
  • you can use `SimpleDateFormat` to get Date from String, but the string must FULLY match the format. You can get the date from "11.11.2011" by "dd.mm.yyyy" format, but you can't get it from "the date is: 11.11.2011". – TEXHIK Apr 13 '15 at 10:33
  • It's more about data mining. – Alex Salauyou Apr 13 '15 at 10:37
  • 2
    this cannot be done. or say, it cannot be done correctly, unless you have known possibilities of dateformats and there are no ambiguity among them. example: input: `03/04/2012`, it is `3rd April` or `4th March`? – Kent Apr 13 '15 at 10:37
  • If you know all the patters try this [http://stackoverflow.com/questions/4024544/how-to-parse-dates-in-multiple-formats-using-simpledateformat] else there is no way – Anjula Ranasinghe Apr 13 '15 at 10:38
  • @Kent Well I have listed many possibilities : `month (d)d, yyyy`, `dd-mon-yyyy`, `mm/dd/yy`, `mm/dd/yyyy`, `dd/mm/yyyy', `day, month dd`, yyyy`... I thought about using regex then convert with DateFormat. Problem of 03/04 cannot be solved obviously, I think about chosing one or the other, with 50% success probability. ^^ – Malik Apr 13 '15 at 10:43
  • @Malik in case of uncertainty you need first to analyze whole text to search for heuristics that can help you. E. g. if 12/25 happens, but 25/12 never happens, then format is more likely to be mm/dd. – Alex Salauyou Apr 13 '15 at 10:45
  • @SashaSalauyou Sure, I can also know certainly if it's dd/mm or mm/dd when I find the first or the second member > 13, which obviously is the day. ^^ – Malik Apr 13 '15 at 10:51
  • @Malik you see, your problem is more interesting and complicated than just calling `SimpleDateFormat#parse`. If you succeed, it will be a great exercise for you as a programmer. Good luck! – Alex Salauyou Apr 13 '15 at 10:54
  • @Malik it will still be recognized as date, you can apply further logic to it. I'd recommend creating list of valid patterns and iterate over it word by word in the text. Then once you get the date, derive the proper format mm/dd or dd/mm from other attributes, such as language, or maybe there are more dates within that text to indicate... – Palcente Apr 13 '15 at 10:57
  • Thank you Sasha, I'll try my best. :) @Palcente I think that'll improve the quality obviously, thanks for the advice ! – Malik Apr 13 '15 at 13:13

1 Answers1

2

I think the best solution is to use a regex, but obviously you have to know all the possible patterns.

A (possible) way to do this is by means of machine learning: you can provide a set of representative examples and let the algorithm finds the patterns for you.

Your problem is really similar to the one addressed in this article. You can try to use this webapp to find a good regular expression for you.

The main problem is that you have to provide significant examples. I hope this will help you!

mimmuz
  • 339
  • 4
  • 9