Handling new line's in regex

Question

I am a newbie to the world of Regex. I am trying to achieve a regular expression task:

I have a string in my blog as given below.

"I am studying Artificial
 Intelligence"

Note: Intelligent is in next line

For example the data looks like:

b_or_i="<div class="MsoNormal"">\n<span style="font-family:">Machine learning however, is a sub-field of Artificial\nIntelligence, probably the biggest field under it.</span></div>"

I wrote an expression to remove all data enclosed in <> as below:

refine=[check for check in re.split("\s*<[^<]*>\s*", b_or_i, re.DOTALL) if check]

After running the above code my output looks like :

['Machine learning however, is a sub-field of Artificial\nIntelligence, probably the biggest field under it.']

Having said that I am trying to collect combination of words belonging to the same entity or group. For Example Artificial Intelligence belongs to the same entity, therefore I need them together. However the presence of "\n" between them is making my life difficult.

The expression I wrote to get the single Entity together is given below:

find_entities=re.findall(r'\b[A-Z]\B\w*(?:\s+\b[A-Z]\B\w*)*', words, re.DOTALL)

The Above code does a great job in getting words like "Unstructured Data Set" or "Artificial Intelligence" but doesn't do good with "Artificial\nIntelligence".

One solution I thought of is by replacing "\n" with a space, but I don't know how it is going to affect my whole document

Any help would be appreciated, thanks :)

When I run your regex on your example text, it finds `Artificial\nIntelligence`... are you sure you have a problem? — Lee, Jul 29 '15 at 15:35
Related to @PadraicCunningham's comment, see perhaps the most famous answer on StackOverflow: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Two-Bit Alchemist, Jul 29 '15 at 15:44
Hi Lee, when I copy the small phrase containing "Artificial\nIntelligence" I am able to see the output, however when I run it on the whole text file I am only able to see "Artificial" as an output. — Sam, Jul 29 '15 at 17:30

Handling new line's in regex

0 Answers0