I am a newbie to the world of Regex. I am trying to achieve a regular expression task:
I have a string in my blog as given below.
"I am studying Artificial
Intelligence"
Note: Intelligent
is in next line
For example the data looks like:
b_or_i="<div class="MsoNormal"">\n<span style="font-family:">Machine learning however, is a sub-field of Artificial\nIntelligence, probably the biggest field under it.</span></div>"
I wrote an expression to remove all data enclosed in <> as below:
refine=[check for check in re.split("\s*<[^<]*>\s*", b_or_i, re.DOTALL) if check]
After running the above code my output looks like :
['Machine learning however, is a sub-field of Artificial\nIntelligence, probably the biggest field under it.']
Having said that I am trying to collect combination of words belonging to the same entity or group. For Example Artificial Intelligence belongs to the same entity, therefore I need them together. However the presence of "\n" between them is making my life difficult.
The expression I wrote to get the single Entity together is given below:
find_entities=re.findall(r'\b[A-Z]\B\w*(?:\s+\b[A-Z]\B\w*)*', words, re.DOTALL)
The Above code does a great job in getting words like "Unstructured Data Set" or "Artificial Intelligence" but doesn't do good with "Artificial\nIntelligence".
One solution I thought of is by replacing "\n" with a space, but I don't know how it is going to affect my whole document
Any help would be appreciated, thanks :)