Questions tagged [text-extraction]

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

Text extraction mechanisms may vary depending on the context and the language applied. Approaches may vary from regular expressions to classifiers till more complex/custom models.

More Info

1282 questions
490
votes
14 answers

How to extract a substring using regex

I have a string that has two single quotes in it, the ' character. In between the single quotes is the data I want. How can I write a regex to extract "the data i want" from the following text? mydata = "some string with 'the data i want' inside";
asdasd
  • 5,099
  • 2
  • 16
  • 7
416
votes
13 answers

Python module for converting PDF to text

Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use.
cnu
  • 36,135
  • 23
  • 65
  • 63
396
votes
23 answers

Extract a single (unsigned) integer from a string

I want to extract the digits from a string that contains numbers and letters like: "In My Cart : 11 items" I want to extract the number 11.
Bizboss
  • 7,792
  • 27
  • 109
  • 174
187
votes
15 answers

How to extract text from a PDF?

Can anyone recommend a library/API for extracting the text and images from a PDF? We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the…
Budda007
  • 1,903
  • 2
  • 12
  • 3
118
votes
8 answers

How to extract string following a pattern with grep, regex or perl

I have a file that looks something like this:
wrangler
  • 1,995
  • 2
  • 19
  • 22
114
votes
6 answers

Extracting text from a PDF file using PDFMiner in python?

I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python. It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have…
RattleyCooper
  • 4,997
  • 5
  • 27
  • 43
84
votes
2 answers

PDF Parsing Using Python - extracting formatted and plain texts

I'm looking for a PDF library which will allow me to extract the text from a PDF document. I've looked at PyPDF, and this can extract the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the…
Mike Cialowicz
  • 9,892
  • 9
  • 47
  • 76
72
votes
4 answers

How to extract common / significant phrases from a series of text entries

I have a series of text items- raw HTML from a MySQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching). My example is any review on Yelp.com,…
arronsky
  • 721
  • 1
  • 6
  • 3
64
votes
4 answers

C# Extract text from PDF using PdfSharp

Is there a possibility to extract plain text from a PDF-File with PdfSharp? I don't want to use iTextSharp because of its license.
der_chirurg
  • 1,475
  • 2
  • 16
  • 26
59
votes
6 answers

How to extract just plain text from .doc & .docx files?

Anyone know of anything they can recommend in order to extract just the plain text from a .doc or .docx? I've found this - wondered if there were any other suggestions?
docextract
  • 663
  • 1
  • 6
  • 3
49
votes
2 answers

How can I read pdf in python?

How can I read pdf in python? I know one way of converting it to text, but I want to read the content directly from pdf. Can anyone explain which module in python is best for pdf extraction
sg1994
  • 557
  • 1
  • 4
  • 6
48
votes
2 answers

Extract text from pdf file using javascript

I want to extract text from pdf file using only Javascript in the client side without using the server. I've already found a javascript code in the following link: extract text from pdf in Javascript and then in…
Coccinelle
  • 527
  • 1
  • 5
  • 6
45
votes
3 answers

PDF text extraction from given coordinates

I would like to extract text from a portion (using coordinates) of PDF using Ghostscript. Can anyone help me out?
AMER
  • 971
  • 2
  • 10
  • 9
45
votes
8 answers

Extract all email addresses from bulk text using jquery

I'm having the this text below: sdabhikagathara@rediffmail.com, "assdsdf" , "rodnsdfald ferdfnson" , "Affdmdol Gondfgale" , "truform techno"…
Milind Anantwar
  • 81,290
  • 25
  • 94
  • 125
42
votes
10 answers

How to extract text from MS office documents in C#

I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI but i didn't get a sample about how to use NPOI.
Elias Haileselassie
  • 1,385
  • 1
  • 18
  • 26
1
2 3
85 86