6

I am working on extracting names of people from various ads appearing in English newspapers .

However , i have noticed that I need to identify the boundary of an Ad , before extracting the names present in it ,since I need only the first occurring name to be extracted .I started with Stanford NLP . I was successful in extracting names . But I got stuck in identifying the paragraph boundary.

Is there any way of identifying the paragraph boundary . ?

MWiesner
  • 8,868
  • 11
  • 36
  • 70
kiran
  • 339
  • 4
  • 18
  • 2
    Your question is a bit vague. Are you looking for structural clues? Linguistic clues? Please post an excerpt from your corpus. – Pierre Nov 21 '13 at 12:59
  • here is a small sample: – kiran Nov 25 '13 at 04:31
  • OBITUARY. GENERAL WILLIAM H. BROWNELL. Brigadier Gonoral William H. BrowncU dlod yes- terday afternoon at his borne, No. 258 Ponn-st., Brook¬ lyn, after an illness of sovoral days. Tho cause of his doath was pneumonia. Ho held the position of Assistant Chiof of Ordnanc© In tho Ordnance De¬ partment of tho Stato at tho tlmo of his death, rank- Ing as Colonel. GEORGE TICKNOR CURTIS, JR. George Tlcknor Curtis, jr., son of Georgo Ttcknor Curtis, tho woll-known author, and grandson of Justice Story, died yesterday in Philadelphia. Ho had beon in poor health for more than a year. – kiran Nov 25 '13 at 04:31
  • your corpus seems pretty noisy. I guess it is the output of an OCR system... what about a regex to match substring of uppercase characters? maybe you can also use a dictionary of proper nouns to filter out unwanted cases... – Pierre Nov 25 '13 at 13:45
  • Hey , i have already tried for matching Uppercase Characters . It doesn't work that great .I am looking forward for any other options available . – kiran Nov 26 '13 at 11:10
  • 1
    Well, a combination of different features might be the solution, e.g., case of characters, matching with a proper noun from a dictionary, length of paragraph-candidates, and so on. You can build rules based on these features. Eventually you will end up with a machine learning approach using more features and some training data (hand-split paragraphs). If you're not familiar with ML give Weka a try (it's easy to use). – Pierre Nov 26 '13 at 16:12

2 Answers2

4

This is a difficult problem, we are facing the same problem in one of our projects. There are some theory papers out there which help define the scope of the problem and potential solutions in detail. I'll include them below.

We're still in the process of R&D so we haven't many answers just yet, but we are willing to share what we have and find as time moves forward.

Here is one such paper:

Automatic Paragraph Identification: A Study across Languages and Domains

Here is the github link for the ISCIBoost Code they use:

Open-source implementation of Boostexter (Adaboost based classifier)

ProfVersaggi
  • 886
  • 13
  • 18
2

There is surprisingly little research on this topic of automatic detection of paragraph boundaries. I have found the following (in addition to the paper provided by profversaggi), all of which are quite old:

Sporleder and Lapata (2005): Broad coverage paragraph segmentation across languages and domains

Filippova and Strube (2006): Using Linguistically Motivated Features for Paragraph Boundary Identification

Genzel (2005) A Paragraph Boundary Detection System

martin_wun
  • 1,599
  • 1
  • 15
  • 33