1

I have HTML string retrieved from Discourse API with some few elements (p, span, div etc.) and some of them have attributes like data-time, data-timezone, data-email-preview etc. I want values that are on attributes data-email-preview and these values are timestamps in format enter code here. These values are always between first two span elements inside HTML string. Example of HTML string:

<p><span data-date="2019-05-10" data-time="19:00:00" class="discourse-local-date" data-timezones="Europe/Brussels" data-timezone="Europe/Berlin" data-email-preview="2019-05-10T17:00:00Z UTC">2019-05-10T17:00:00Z</span> → <span data-date="2019-05-10" data-time="22:00:00" class="discourse-local-date" data-timezones="Europe/Brussels" data-timezone="Europe/Berlin" data-email-preview="2019-05-10T20:00:00Z UTC">2019-05-10T20:00:00Z</span><br>
<div class="lightbox-wrapper"><div class="meta">
<span class="filename">HackSpace_by_Sugar_Ray_Banister.jpg</span><span class="informations">1596×771 993 KB</span><span class="expand"></span>
</div></a></div></p>

I need these two dates between span elements extracted:

2019-05-10T17:00:00Z and 2019-05-10T20:00:00Z

4 Answers4

1

(?<=>)(\d{4}\-\d{2}\-\d{2}T\d{2}\:\d{2}\:\d{2}Z)(?=<\/span>)

Would return you the elements you required

AleksW
  • 703
  • 3
  • 12
0

maybe this would meet your needs ?

https://regex101.com/r/Jo4srA/1

(slighlty editted to meet your needs)

Carl Verret
  • 576
  • 8
  • 21
  • Couple problems with this is that it also returns dates from the `data-email-preview`, and does not include the `Z` – AleksW Apr 17 '19 at 15:49
  • adding Z caracter is a matter of seconds. Could you please detail more about what is supposed to be ignored and what's to be captured ? – Carl Verret Apr 17 '19 at 15:58
0

You can achieve that by using HTML DOM library which is on github but I use sourceforge to download on this link https://simplehtmldom.sourceforge.io

Use it as follows

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images 
foreach($html->find('img') as $element) 
echo $element->src . '<br>';

// Find all links 
foreach($html->find('a') as $element) 
echo $element->href . '<br>';

You should use span as

// find('span.data-email-preview')  if not work use  find('date-email-preview')

If you want to use preg_replace it's easy but will be confusing because there's a lot of values so the output will be many dates then you have to make array of this output after that make aloop to view every date in single line so you can import to database

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Creative87
  • 125
  • 9
0

in VBA something like that

Sub Extract2()

    Dim hDoc As MSHTML.HTMLDocument
    Dim hElem As MSHTML.HTMLGenericElement
    Dim sFile As String, lFile As Long
    Dim pat1 As String
    Dim sHtml As String
        strHtml = "c:\1.html"
               'read in the file
                lFile = FreeFile
                sFile = strDir & strHtml
                Open sFile For Input As lFile
                sHtml = Input$(LOF(lFile), lFile)

                'put into an htmldocument object
                Set hDoc = New MSHTML.HTMLDocument
                hDoc.body.innerHTML = sHtml

                Set dateBody = hDoc.getElementsByClassName("discourse-local-date")
                Date1 = dateBody(0).innerText
                Date2 = dateBody(1).innerText
                    MsgBox Date1 & " " & Date2
                'regex
                pat1 = ".*span.*>(.+?)<"
                Date1 = simpleRegex(sHtml, pat1, 0)
                Date2 = simpleRegex(sHtml, pat1, 1)
                    MsgBox Date1 & " " & Date2

End Sub

function for regex

Function simpleRegex(strInput As String, strPattern As String, sNr As Long)
    Dim regEx As New RegExp
    If strPattern <> "" Then
        With regEx
            .Global = True
            .MultiLine = True
            .IgnoreCase = True
            .Pattern = strPattern
        End With
        dfs = regEx.Test(strInput)
        If regEx.Test(strInput) Then
            Set sReg = regEx.Execute(strInput)
            simpleRegex = sReg(sNr).SubMatches(0)
        Else
            simpleRegex = "false"
        End If
    End If
End Function
Dmitrij Holkin
  • 1,995
  • 3
  • 39
  • 86