0

we have a requirement to extract dark data from unstructured sources such as letters, rad reports, etc. Please suggest azure resource to extract data from common document formats: DOC, DOCX, PDF, RTF, TXT, HTML, etc and then to do analysis on the extracted data.

191180rk
  • 735
  • 2
  • 12
  • 37

1 Answers1

0

It sounds like you just want to extract raw text or images from these rich text format documents. If only do these, some libraries of parsing different documents is your real needs.

Here is some libraries in Java or Python to do that. If you are using .NET which I'm not familiar with, you can search in Google or Bing to find these alternative for .NET.

  1. To parse the office document like DOC, DOCX: for Java, Apache POI is a good library for extracting data from MS office files; for Python, there seems to be not any package to do that, except using COM object like Word.Application or IronPython (Reading/Writing MS Word files in Python) in .NET on Windows.
  2. To parse PDF files: there are Apache PDFBox, jPDFText for Java and PyPDF2 for Python.
  3. To read RTF format file: Java natively supports via javax.swing.text.rtf.RTFEditorKit which you can get some sample code via search; like #1, also seems none for Python.
  4. To parse HTML files: jsoup for Java and BeautifulSoup & HTMLParser for Python are best for extracting data from HTML.
  5. For reading TXT format files, I think it's simple for any languages. But to extract valuable information from text content, Stanford NLP for Java and NLTK for Python are useful, also using Azure Text Analytics API of Cognitive Service can help doing some like key phrase extraction, and language detection.
  6. Apache Tika toolkit for content analysis is a good solution, too. Even you can deploy it alone and to invoke its REST APIs by Python, other languages.
  7. If you want to extract text from images, you can use Azure Computer Vision API of Cognitive Services to extract printed text or handwritten text, or use the third party library such as Tess4J or others you searched in GitHub.

All of above are almost depended on the third party dev kits without Azure resources. However, you can store these documents in Azure Storage and process them on Azure VM or Batch services, even to analyze the extract data in Azure Jupyter Notebook or use Azure ML to do more deeper research.

Peter Pan
  • 23,476
  • 4
  • 25
  • 43
  • Are there any Natural Language Processing (NLP) apis/tools specific to healthcare domain to extract unstructured data from clinical notes, medical reports and investigation ? – 191180rk Mar 29 '19 at 14:21
  • @thiru I'm not familiar with healthcare domain and I'm not sure what about NLP specific to it as you said is. I just searched for Python and healthcare to get the results https://pythonhealthcare.org/ and https://healthcare.ai/. Maybe you can help yourself to search it. – Peter Pan Mar 30 '19 at 10:59