0

I am trying to parse a user's event descriptions after obtaining access to their google calendar through the google calendar API. When I input the description into my program, I want to get rid of default (and useless) text such as Zoom meeting invitations. If the following below is the description string

<br>Hi, please keep this text.<br>

<br>Bob is inviting you to a scheduled Zoom meeting.<br>

<br>Topic: Bob\'s Personal Meeting Room<br>
<br>Join Zoom Meeting<br>
<a href="https://us04web.zoom.us/j/4487518794?pwd=SkdaTE9nV3E1M3FSaWlOHYvNGlndz09">https://us04web.zoom.us/j/4487518794?pwd=SkdaTE9nV3E1M3FSaWplOHYvNGlndz09</a><br>
<br>Meeting ID: 448 751 8#94<br>
Password: 1F9W2P<br>

<br>Also not Zoom default text.

How can I parse it so that only "Hi, please keep this test. Also not Zoom default text" remains?

Alessandro
  • 2,848
  • 1
  • 8
  • 16
Danlo9
  • 133
  • 3
  • 12

1 Answers1

3

Methodology

I think this would be a good use for Regular Expressions or RegEx. This is essentially a pattern-matching standard that allows for generalizing a certain structure in a string. While use in HTML and XML is not a good idea as it is not designed to extract any information you may be looking for, it should work if all you want to do is discard certain sections.

Explanation

If I understand correctly, you would like to be left with

<br>Hi, please keep this text.<br>

<br>Also not Zoom default text.<br>

Which means we need to come up with a pattern to match the following portion(the brackets indicating the information that will swap every time):

<br>[Name] is inviting you to a scheduled Zoom meeting.<br>

<br>Topic: [Name]\'s Personal Meeting Room<br>
<br>Join Zoom Meeting<br>
<a href="[Link]">[Link]</a><br>
<br>Meeting ID: [ID]<br>
Password: [Password]<br>

Important Pieces:

  • The beginning: [Name] will be some string of at least one character. To make sure you don't match <br>Hi, please keep this text.<br>, the part we want to match any characters that aren't "<br>" (this is represented in regex with [^(?:<br>)]), where "character" means anything other than a line break. The rest of the sentence should be matched word for word, so we're not just matching anything.

  • The end: [Password], like [Name], is just [^(?:<br>)] for the same reason.

  • This string starts and ends with "<br>". This should be reflected in the regex

  • Everything between that first sentence and the password portion, even though they have a format, they are wildcards, some mix of at least one character or linebreak (represented in regex with (.|\n)+)

Replacing all of the appropriate portions in the text, you get the following:

<br>[^(?:<br>)]+? is inviting you to a scheduled Zoom meeting.+?Password: [^(?:<br>)]+?<br>

Code

As for the Python, the re module will come in handy here as your regex aid: We want to save the above pattern into a variable, and use the information to cut the appropriate portion out of the string.

To "save" the pattern, the re module allows you to compile the regex into an object (the r before the string indicates that it contains regex)

import re
zoom_pattern = re.compile(r"<br>[^(?:<br>)]+? is inviting you to a scheduled Zoom meeting.+?Password: [^(?:<br>)]+?<br>")

The module also provides the ability to split replace regex matches within strings, and we can replace our match with nothing to cut it out of the string:

import re
s = " - string with zoom meeting stuff - "
zoom_pattern = re.compile(r"<br>[^(?:<br>)]+? is inviting you to a scheduled Zoom meeting.+?Password: [^(?:<br>)]+?<br>")

clean_string = zoom_pattern.sub("", s)

Since we compiled the pattern, you now have a reusable way to clean up your string!

If you'd like to change your regex to match each individual thing, just adjust the "Important points" from earlier to match your goal. If you want to test your ideas, this is a wonderful resource!

Marcel M
  • 156
  • 6
  • good answer! note that [regular expressions aren't normally recommended for parsing html/xml](https://stackoverflow.com/a/1732454/1358308) but given the info here (and how I've seen these invites used) a regex like this does very well considering the effort! note that `zoom_pattern.sub('', s)` might be easier – Sam Mason Jun 27 '20 at 22:54
  • Thanks for the feedback! I figured since OP doesn't seem to want any actual information from the HTML, this could be a good shortcut – Marcel M Jun 27 '20 at 23:18
  • Wouldn't this also get rid of the `
    Hi, please keep this text.
    ` at the beginning because it also matches the `
    .+` part of the regex?
    – Danlo9 Jun 28 '20 at 17:13
  • Great question! the `
    .+` by itself would indeed match that as well. The difference is, to match that whole string, it would also need to match the line break between the lines. The `.+` only selects non-line break characters, so that portion would not match. If you'd like, you could make the name portion only match "word characters" (letters, numbers, and underscores), but that would mean you would have to account for hyphenated names, first name-last name, etc.
    – Marcel M Jun 28 '20 at 17:21
  • Ohhh I see. My actual string does not have a line break. I manually placed in the line break to make it more readable. It actually looks like this: ```
    Hi, please keep this text.

    [name] is inviting you to a scheduled Zoom meeting.

    Topic: [name]\'s Personal Meeting Room

    Join Zoom Meeting
    [link]

    Meeting ID: 448 751 8794
    Password: [Password]

    Also not Zoom default text.```
    – Danlo9 Jun 28 '20 at 21:09
  • I edited the answer to try and fix that issue, let , me know if it works – Marcel M Jun 28 '20 at 23:57