0

I am a newbie to the world of Regex. I am trying to achieve a regular expression task:

I have a string in my blog as given below.

"I am studying Artificial
 Intelligence"

Note: Intelligent is in next line

For example the data looks like:

b_or_i="<div class="MsoNormal"">\n<span style="font-family:">Machine learning however, is a sub-field of Artificial\nIntelligence, probably the biggest field under it.</span></div>"

I wrote an expression to remove all data enclosed in <> as below:

refine=[check for check in re.split("\s*<[^<]*>\s*", b_or_i, re.DOTALL) if check]

After running the above code my output looks like :

['Machine learning however, is a sub-field of Artificial\nIntelligence, probably the biggest field under it.']

Having said that I am trying to collect combination of words belonging to the same entity or group. For Example Artificial Intelligence belongs to the same entity, therefore I need them together. However the presence of "\n" between them is making my life difficult.

The expression I wrote to get the single Entity together is given below:

find_entities=re.findall(r'\b[A-Z]\B\w*(?:\s+\b[A-Z]\B\w*)*', words, re.DOTALL)

The Above code does a great job in getting words like "Unstructured Data Set" or "Artificial Intelligence" but doesn't do good with "Artificial\nIntelligence".

One solution I thought of is by replacing "\n" with a space, but I don't know how it is going to affect my whole document

Any help would be appreciated, thanks :)

msturdy
  • 10,479
  • 11
  • 41
  • 52
Sam
  • 2,545
  • 8
  • 38
  • 59
  • 3
    Use an actual html parser – Padraic Cunningham Jul 29 '15 at 15:31
  • When I run your regex on your example text, it finds `Artificial\nIntelligence`... are you sure you have a problem? – Lee Jul 29 '15 at 15:35
  • don't replace with space... s/\\n/\n/g – Wolfger Jul 29 '15 at 15:36
  • To start off, `[A-Z]\B\w*` is equivalent to `[A-Z]\w+` –  Jul 29 '15 at 15:40
  • Related to @PadraicCunningham's comment, see perhaps the most famous answer on StackOverflow: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Two-Bit Alchemist Jul 29 '15 at 15:44
  • Hi Lee, when I copy the small phrase containing "Artificial\nIntelligence" I am able to see the output, however when I run it on the whole text file I am only able to see "Artificial" as an output. – Sam Jul 29 '15 at 17:30

0 Answers0