0

I'd like to know if it's possible to search for all verbs in a Microsoft Word document.

I've found that you can find all the "forms" of a particuliar verb (for example search "be", and word will find "be","am","are","was", etc.) but I need something more general: just find every verbs (and maybe their form).

I've also looked at this Checking whether a particular word is a noun or verb and I saw "using VBA". Is there some sort of API I can use with ms-Word to find all verbs or accesing some kind of metadata/registry about words ? Or is there some kind of special regex I can use for this?

I understand that sometimes it cannot determine whether a word is a noun or a verb, but that's not a problem if it's not 100% accurate.

For some context: I'm writing in french, and even though ms-Word finds a lot of mistakes, it doesn't find them all. There are some kind of recurrent mistakes that ms-Word doesn't see, but that I could easily/quickly check myself if I searched for every verbs (faster than having to reread the whole document).

I'm using Microsoft Office 2007 SP3.

Edit: of course I'm not sure if it is possible, but ms-Word seems to know this rather accuretly. I believe ms-Word has some ways to find whether a word is a verb, a noun, a plural, etc. based on how it's able to correct grammatical mistakes. Maybe I'm wrong on how ms-Word works, maybe I'm right but there is no way to access this kind of data. And if I'm right and it's possible to acces it, how ?

Community
  • 1
  • 1
Asoub
  • 2,273
  • 1
  • 20
  • 33
  • which programming language? – Fredrik Oct 20 '16 at 14:40
  • The post you link to correctly points out how near impossible this is. This would require a fairly complicated AI to even determine what a verb is. You'll probably have even more errors by having a computer try to determine what's a verb. – Carcigenicate Oct 20 '16 at 14:43
  • @FredrikRedin I was hoping with either a regex or VBA (if ms-Word showed some kind of API for this). – Asoub Oct 20 '16 at 14:50
  • @Carcigenicate maybe just a big database is embedded in ms-Word, and that would be enough ? As for verb, context might help, and ms-Word seems to be rather good at this. – Asoub Oct 20 '16 at 14:51
  • @Asoub It would be extremely context dependant. I'm going to bet that unless you can find a library made for exactly this purpose, you're going to have a very difficult to with this. – Carcigenicate Oct 20 '16 at 14:59
  • Regex is standard for creating a sequence of characters that define a search pattern, you still need some sort of programming language to interpret results. If you are new to programming I would recommend C# - it's modern, IMO easier than VBA and many others, and together with the 'OpenXML SDK' from Microsoft makes reading/parsing Word documents programmatically easy. To determine if a word is a verb or not, I would use a good dictionary REST API to help me out (there are many dictionary APIs out there). Good luck. – Fredrik Oct 21 '16 at 08:23
  • OpenXML SDK: https://msdn.microsoft.com/en-us/library/office/bb448854.aspx Dictionary API: http://www.programmableweb.com/category/dictionary – Fredrik Oct 21 '16 at 08:23
  • @FredrikRedin thanks a lot ! I'm Java developper, so learning a little bit of C# won't be a problem. I remember something like word documents being an XML inside a ZIP, so I guess I see what you're telling me. You can add this as an answer rather than a comment. I'll continue to look for other but I don't think I'll find anything better than that ! – Asoub Oct 21 '16 at 08:37

1 Answers1

1

Regex is standard for creating a sequence of characters that define a search pattern, you still need some sort of programming language to interpret results. If you are new to programming I would recommend C# - it's modern, IMO easier than VBA and many others, and together with the 'OpenXML SDK' from Microsoft makes reading/parsing Word documents programmatically easy. To determine if a word is a verb or not, I would use a good dictionary REST API to help me out (there are many dictionary APIs out there).

Edit: If you are comfortable with Java, use Java. Since .docx files are really XML, you can use Java to drill down into the XML and find all text elements (as well as make calls to a dictionary REST API of your choice).

XML structure of a .docx document:

<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:body>
    <w:p>
      <w:r>
        <w:t>Example text.</w:t>
      </w:r>
    </w:p>
  </w:body>
</w:document>

Good luck

OpenXML SDK: msdn.microsoft.com/en-us/library/office/bb448854.aspx https://msdn.microsoft.com/en-us/library/office/ff478541.aspx

Dictionary API: http://www.programmableweb.com/category/dictionary

How read Doc or Docx file in java: https://stackoverflow.com/a/7102794/1380061

Community
  • 1
  • 1
Fredrik
  • 2,247
  • 18
  • 21