5

Simply this is what I am trying to do : (I want to use jsoup)

    1. pass only one url to parse
    2. search for date(s) which are mentioned inside the contents of web page
    3. Extracts at least one date from the each page contents
    4. convert that date into standard format

So, Point #1 What I have now :

String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";
Document document = Jsoup.connect(url).get();

Now here I want to understand what kind of format is "Document", is it parsed already from html or any type of web page type or what?

Then Point #2 What I have now:

Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Elements elements = document.getElementsMatchingOwnText(p);

Here, I am trying to match a date regex to search for dates in the page and store in a string for later use(Point #3), but I am sure i am no near it, need help here.

I have done point #4.

So please anyone who can help me to understand and take me to the right direction how can I achieve those 4 points I mentioned above.

Thanks in Advance !

Updated : So here how I want :

public static void main(String[] args){
    try {
        // using USER AGENT for giving information to the server that I am a browser not a bot
        final String USER_AGENT =
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";

        // My only one url which I want to parse
        String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";

        // Creating a jsoup.Connection to connect the url with USER AGENT
        Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);

        // retrieving the parsed document
        Document htmlDocument = connection.get();

        /* Now till this part, I have A parsed document of the url page which is in plain-text format right?
         * If not, in which type or in which format it is stored in the variable 'htmlDocument'
         * */

        /* Now, If 'htmlDocument' holds the text format of the web page
         * Why do i need elements to find dates, because dates can be normal text in a web page,
         * So, how I am going to find an element tag for that?
         * As an example, If i wanted to collect text from <p> paragraph tag, 
         * I would use this : 
         */
        // I am not sure is it correct or not
        //***************************************************/
        Elements paragraph = htmlDocument.getElementsByTag("p");
        for(Element src: paragraph){
            System.out.println("text"+src.attr("abs:p"));
        }
       //***************************************************//

        /* But I do not want any elements to find to gather dates on the page
         * I just want to search the whole text document for date
         * So, I need a regex formatted date string which will be passed as a input for a search method
         * this search mechanism should be on text formatted page as we have parsed document in 'htmlDocument'
         */

        // At the end we will use only one date from our search result and format it in a standard form

        /*
         * That is it.
         */


        /*
         * I was trying something like this
         */
        //final Elements elements = document.getElementsMatchingOwnText("\\d{4}-\\d{2}-\\d{2}");
        Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
        Elements elements = htmlDocument.getElementsMatchingOwnText(p);

        for(Element e: elements){
            System.out.println("element = [" + e + "]");
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}
Fahim Uddin
  • 681
  • 2
  • 10
  • 36
  • The Document extends Element (https://jsoup.org/apidocs/org/jsoup/nodes/Element.html), so yes, it is already parsed. For your use case it might be the simplest approach to simply grab the text content with .text() (https://jsoup.org/apidocs/org/jsoup/nodes/Element.html#text--). Please provide an example with selector for the target element, observed content and intended output (http://stackoverflow.com/help/mcve). – Frederic Klein Jan 31 '17 at 08:53
  • @FredericKlein Can you check now? Bro, Thanks in Advance ! – Fahim Uddin Jan 31 '17 at 13:37
  • @RayD'vard just to clarify, here for example with the stackoverflow page, your result would be empty, there are no dates in this page right? – cralfaro Jan 31 '17 at 14:09
  • @cralfaro if you go on that page, you can see comments date and time, edited date and time right? and after executing my example code i get nothing as results, or my regex validation with multiple date formats is not correct ? – Fahim Uddin Jan 31 '17 at 14:14
  • @RayD'vard give me a second i was checking the current page – cralfaro Jan 31 '17 at 14:16
  • @RayD'vard and you want to find date values for any page or for that specific page? – cralfaro Jan 31 '17 at 14:23
  • @cralfaro for any page bro, I know there are various types of date formats used in web pages, I am not sure what to do :( . – Fahim Uddin Jan 31 '17 at 14:26
  • @RayD'vard i am having same problem, any regexp find elements, even the easier cases...we are missing something – cralfaro Jan 31 '17 at 14:48
  • @cralfaro well bro I am exploring the date format and corresponding regex from here [link](http://regexlib.com/DisplayPatterns.aspx?cattabindex=4&categoryid=5&p=1), yeah we are lost in the regex river lol .. – Fahim Uddin Jan 31 '17 at 14:54
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/134500/discussion-between-ray-dvard-and-cralfaro). – Fahim Uddin Jan 31 '17 at 15:30

1 Answers1

2

Here is one possible solution i found:

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;

import java.util.List;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

/**
 * Created by ruben.alfarodiaz on 21/12/2016.
 */
@RunWith(JUnit4.class)
public class StackTest {

    @Test
    public void findDates() {
        final String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";
        try {
            String url = "http://stackoverflow.com/questions/51224/regular-expression-to-match-valid-dates";
            Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
            Document htmlDocument = connection.get();
            //with this pattern we can find all dates with regex dd/mm/yyyy if we need cover extra formats we should create N more patterns
            Pattern pattern = Pattern.compile("(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\\d\\d)");

            //Here we find all document elements which have some element with the searched pattern  
            Elements elements = htmlDocument.getElementsMatchingText(pattern);
            //in this loop we are going to filter from all original elements to find only the leaf elements
            List<Element> finalElements = elements.stream().filter(elem -> isLastElem(elem, pattern)).collect(Collectors.toList());
            finalElements.stream().forEach(elem ->
                System.out.println("Node: " + elem.html())
            );

        }catch(Exception ex){
            ex.printStackTrace();
        }
    }

    //Method to decide if the current element is a leaf or contains others dates inside  
    private boolean isLastElem(Element elem, Pattern pattern) {
        return elem.getElementsMatchingText(pattern).size() <= 1;
    }

}

The point should be added as many patterns as need because I think would be complex find a single pattern which matches all posibilities

Edit: The most important is that the library give you a hierarchy of elements so you need to itarete over them to find the final leaf. For instance

<html>
    <body>
        <div>
           20/11/2017    
        </div>
    </body>
</html>

If we find for the pattern dd/mm/yyyy the library will return 3 elements html, body and div, but we are just interested in div

cralfaro
  • 5,822
  • 3
  • 20
  • 30
  • Bro can you update it with inline comment on the Pattern and elements part and extra dependent imports. Thanks ! – Fahim Uddin Jan 31 '17 at 15:48
  • @RayD'vard hope it helps you! If its a good starting point for you, could you validate the response? Thanks! – cralfaro Jan 31 '17 at 16:18
  • Okay I have got it how cleverly you are finding the leaf, that is pretty neat. Thanks for that direction bro, I can use that, now I have to find the right pattern for my regex expression (well, I have made a regex which is now suitable for my needs) and then extract only one date from the whole page into string or java date format which I will format in a standard form. Thanks bro ! – Fahim Uddin Jan 31 '17 at 16:35