1

I am currently learning Regex in Python and my expected Regex results are not showing (I'm running Python 3.6). Below is the code to get the String values I'm running my regex against:

import json
import os
import pandas as pd
import requests
import re

url = 'http://www.trumba.com/calendars/brisbane-city-council.json'
uh = requests.get(url)

json_data = json.loads(uh.text)
json_str = json.dumps(json_data)
panda_json = pd.read_json(json_str, typ = 'frame')

Now, I want to take a match of the html hyperlink in 'location'

With the Regex, I'm expecting to find matches such as below (anything between [<] and [>]):

<a href="http://maps.google.com/?q=33+Teevan+St%2c+Stafford+QLD+4053%2c+Australia" target="_blank">

so I'm using below Regex:

pattern = re.compile(r'/[<].*?[>]/')

and then try to store them into a dataframe

matches = re.findall(pattern, str(panda_json['location']))

x = []
for match in matches:
    x.append(match)

x = pd.DataFrame(x)

But 'x' does not show anything? I'm sure I'm missing something obvious.

halfer
  • 19,824
  • 17
  • 99
  • 186
mlinardy
  • 37
  • 4
  • You are probably looking for `panda_json['location'].str.extract(r'<([^>]+)>')` – Wiktor Stribiżew Feb 08 '19 at 09:20
  • 3
    Note that you should really [avoid parsing HTML with regex](https://stackoverflow.com/a/1732454/1678362) and that python has the delightful [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library that handles parsing and extracting data from html – Aaron Feb 08 '19 at 09:21
  • Thanks both! @WiktorStribiżew it works. I will find out the full definition of the regex. thanks for your help. – mlinardy Feb 09 '19 at 03:59
  • @Aaron thanks also for the advice and for pointing to that post. I will dig deeper why we should avoid parsing HTML with regex (although I'm a bit confused.. the solution from Wiktor seems to work).. – mlinardy Feb 09 '19 at 04:00
  • The issue mostly is that HTML is a very permissive language and trying to account for all pitfalls in a regex isn't worth it especially when other tools already do it. Even in your simple case there could (technically, although unlikely) be a `<![CDATA` block or ` – Aaron Feb 09 '19 at 11:18

1 Answers1

0

You may simply extract the substrings between < and > using

panda_json['location'].str.extract(r'<([^>]+)>')

The <([^>]+)> pattern matches < with <, then matches one or more chars other than > with [^>]+ and - since the pattern is enclosed with ( and ) - is placed into Group 1 (and .str.extract outputs just the value captured), and then > matches a > char.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563