-2

I have a project wherein I need to extract quotations from a huge set of articles . Here , by quotations I mean things said by people , for eg: Alen said " text to be extracted ." I'm using NLTK for my other NLP related tasks so any solution using NLTK or any kind of Python library would be quite useful.

Thanks

Ic3fr0g
  • 1,199
  • 15
  • 26

2 Answers2

2

As Mayur mentioned, you can do a regex to pick up everything between quotes

list = re.findall("\".*?\"", string)

The problem you'll run into is that there can be a surprisingly large amount of things between quotation marks that are actually not quotations.

If you're doing academic articles, you can look for a number after the closing quotation to pick up the footnote number. Else with non academic articles, perhaps you could run something like:

"(said|writes|argues|concludes)(,)? \".?\""

can be more precise, but risks losing quotes such as blockquotes (blockquotes will cause you problems anyways because they can include a newline before the closing quotation mark)

As for using NLTK, I can't think of anything there that will be of much help other than perhaps wordnet for finding synonyms for "said".

Joseph
  • 691
  • 1
  • 4
  • 12
0

This qualifies as a pattern, ie, data you are looking for is always between quotation marks "". Simply put, you can use regex for pattern matching. Let's take this example she said " DAS A SDASD sdasdasd SADSD", " SA23 DSD " ASDAS "ASDAS1 3123$ %$%"

The regex that works for your basic example is -

list = re.findall("\".*?\"", string)

List gives us ['" DAS A SDASD SADASD SADSD"', '" SA23 DSD "', '"ASDAS1 3123$ %$%"']

Here, .*? matches any character (except newline) and the pattern matches the quotation marks (beginning \" and ending \") literally.

Please beware of the fact that quotation marks within quotation marks breaks this code. You will not get the expected output.

Ic3fr0g
  • 1,199
  • 15
  • 26
  • This picks up anything between quotation marks. Depending on the text you are evaluating, you'll pick up a bunch of junk that isn't reported speech... just stuff wrapped in quotation marks like the letter "A", the word "obscure" means... etc. – Joseph Jun 21 '16 at 07:28
  • I've worked under the assumption that OP is working with **Structured data that has meaningful conversation** because OP says `I need to extract quotations from a huge set of articles`. So my assumption is a reasonable one.I'll +1 you for adding something good to the answers. – Ic3fr0g Jun 21 '16 at 08:22