0

I'm getting information from a website that has no API or anything. I've got the login and retrieve HTML part working and I've got a system that finds the right <div> that will contain the information I need. But I need to remove all the information that isn't in the format "DD/MM/YYYY". So I need to remove all the parts of this string that aren't in that format. Here's an example of the returned <div>:

<div id="wkDrop">
    <div  name="weekstarts" id="2018_29">Week 29-16/07/2018</div>
    <div style="display:none" name="weekstarts" id="2018_30">Week 30-23/07/2018</div>
</div>

The parts that will change each week are the id="YYYY_WW" and Week WW-DD/MM/YYYY. So from the above example, I'm after two dates: 16/07/2018 and 23/07/2018.

Please bear in mind that there could be between 1 and 4 dates within this <div> so it won't always be two weeks that I need to extract.

I would also ideally have each date retrieved printed on a new line.

Any ideas how I'd go about this?

Thanks in advance for any replies :)

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Since all dates are in the same format, you can use regular expressiosn – c2huc2hu Jul 18 '18 at 14:29
  • I would **highly recommend** you **not** to use a regular expression. Parse the HTML first using `bs4` (or similar) and then extract the dates using a simple string manipulation (they have a single format) it seems. – Reut Sharabani Jul 18 '18 at 14:33

3 Answers3

0

I'd say first you should look into using BeautifulSoup to remove the div tags and extract the data. Then, you could use split("^(0?[1-9]|[12][0-9]|3[01])[\/\-](0?[1-9]|1[012])[\/\-]\d{4}$").

` to split the string into an array of strings that follow the regular expression:

^(0?[1-9]|[12][0-9]|3[01])[\/\-](0?[1-9]|1[012])[\/\-]\d{4}$

To extract dates in that format (ref)

K. Dackow
  • 456
  • 1
  • 3
  • 15
0

You can use regular expression (in Python module re - the documentation is here) for retrieving the dates. The explanation for this regular expression can be found here.

data = """
<div id="wkDrop">
    <div  name="weekstarts" id="2018_29">Week 29-16/07/2018</div>
    <div style="display:none" name="weekstarts" id="2018_30">Week 30-23/07/2018</div>
</div>"""

import re

for dates in re.findall(r'\d{2}/\d{2}/\d{4}', data):
    print(dates)

Prints:

16/07/2018
23/07/2018
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
0

how about re module:

import re

str1 = '<div id="wkDrop"><div  name="weekstarts" id="2018_29">Week 29-16/07/2018</div><div style="display:none" name="weekstarts" id="2018_30">Week 30-23/07/2018</div></div>'

match=re.findall('(\d+/\d+/\d+)',str1)

Output:

['16/07/2018', '23/07/2018']
Chetan Ameta
  • 7,696
  • 3
  • 29
  • 44