we have a requirement to extract dark data from unstructured sources such as letters, rad reports, etc. Please suggest azure resource to extract data from common document formats: DOC, DOCX, PDF, RTF, TXT, HTML, etc and then to do analysis on the extracted data.
Asked
Active
Viewed 432 times
1 Answers
0
It sounds like you just want to extract raw text or images from these rich text format documents. If only do these, some libraries of parsing different documents is your real needs.
Here is some libraries in Java or Python to do that. If you are using .NET which I'm not familiar with, you can search in Google or Bing to find these alternative for .NET.
- To parse the office document like DOC, DOCX: for Java,
Apache POI
is a good library for extracting data from MS office files; for Python, there seems to be not any package to do that, except using COM object likeWord.Application
orIronPython
(Reading/Writing MS Word files in Python) in .NET on Windows. - To parse PDF files: there are
Apache PDFBox
,jPDFText
for Java andPyPDF2
for Python. - To read RTF format file: Java natively supports via
javax.swing.text.rtf.RTFEditorKit
which you can get some sample code via search; like #1, also seems none for Python. - To parse HTML files:
jsoup
for Java andBeautifulSoup
&HTMLParser
for Python are best for extracting data from HTML. - For reading TXT format files, I think it's simple for any languages. But to extract valuable information from text content,
Stanford NLP
for Java andNLTK
for Python are useful, also using Azure Text Analytics API of Cognitive Service can help doing some like key phrase extraction, and language detection. - Apache Tika toolkit for content analysis is a good solution, too. Even you can deploy it alone and to invoke its REST APIs by Python, other languages.
- If you want to extract text from images, you can use Azure Computer Vision API of Cognitive Services to extract printed text or handwritten text, or use the third party library such as
Tess4J
or others you searched in GitHub.
All of above are almost depended on the third party dev kits without Azure resources. However, you can store these documents in Azure Storage and process them on Azure VM or Batch services, even to analyze the extract data in Azure Jupyter Notebook or use Azure ML to do more deeper research.

Peter Pan
- 23,476
- 4
- 25
- 43
-
Are there any Natural Language Processing (NLP) apis/tools specific to healthcare domain to extract unstructured data from clinical notes, medical reports and investigation ? – 191180rk Mar 29 '19 at 14:21
-
@thiru I'm not familiar with healthcare domain and I'm not sure what about NLP specific to it as you said is. I just searched for Python and healthcare to get the results https://pythonhealthcare.org/ and https://healthcare.ai/. Maybe you can help yourself to search it. – Peter Pan Mar 30 '19 at 10:59