Questions tagged [text-extraction]

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

Text extraction mechanisms may vary depending on the context and the language applied. Approaches may vary from regular expressions to classifiers till more complex/custom models.

More Info

Information extraction on wikipedia

1282 questions

490

votes

14 answers

How to extract a substring using regex

I have a string that has two single quotes in it, the ' character. In between the single quotes is the data I want. How can I write a regex to extract "the data i want" from the following text? mydata = "some string with 'the data i want' inside";

asked Jan 11 '11 at 20:22

asdasd

5,099
2
16
7

416

votes

13 answers

Python module for converting PDF to text

Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use.

python pdf text-extraction pdf-scraping

asked Aug 25 '08 at 04:44

cnu

36,135
23
65
63

396

votes

23 answers

Extract a single (unsigned) integer from a string

I want to extract the digits from a string that contains numbers and letters like: "In My Cart : 11 items" I want to extract the number 11.

php string integer text-extraction

asked Jun 08 '11 at 11:53

Bizboss

7,792
27
109
174

187

votes

15 answers

How to extract text from a PDF?

Can anyone recommend a library/API for extracting the text and images from a PDF? We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the…

pdf text ghostscript extract text-extraction

asked Sep 06 '10 at 11:11

Budda007

1,903
2
12
3

118

votes

8 answers

How to extract string following a pattern with grep, regex or perl

I have a file that looks something like this:

regex perl sed html-parsing text-extraction

asked Feb 22 '11 at 16:34

wrangler

1,995
2
19
22

114

votes

6 answers

Extracting text from a PDF file using PDFMiner in python?

I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python. It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have…

python python-3.x python-2.7 text-extraction pdfminer

asked Oct 21 '14 at 18:56

RattleyCooper

4,997
5
27
43

votes

2 answers

PDF Parsing Using Python - extracting formatted and plain texts

I'm looking for a PDF library which will allow me to extract the text from a PDF document. I've looked at PyPDF, and this can extract the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the…

python pdf parsing text-extraction information-extraction

asked Dec 04 '09 at 17:28

Mike Cialowicz

9,892
9
47
76

votes

4 answers

How to extract common / significant phrases from a series of text entries

I have a series of text items- raw HTML from a MySQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching). My example is any review on Yelp.com,…

nlp text-extraction nltk text-analysis

asked Mar 16 '10 at 08:42

arronsky

votes

4 answers

C# Extract text from PDF using PdfSharp

Is there a possibility to extract plain text from a PDF-File with PdfSharp? I don't want to use iTextSharp because of its license.

c# text text-extraction pdfsharp

asked Apr 13 '12 at 12:48

der_chirurg

1,475
2
16
26

votes

6 answers

How to extract just plain text from .doc & .docx files?

Anyone know of anything they can recommend in order to extract just the plain text from a .doc or .docx? I've found this - wondered if there were any other suggestions?

unix extract docx doc text-extraction

asked Apr 15 '11 at 03:12

docextract

votes

2 answers

How can I read pdf in python?

How can I read pdf in python? I know one way of converting it to text, but I want to read the content directly from pdf. Can anyone explain which module in python is best for pdf extraction

python python-2.7 pdf text-extraction

asked Aug 21 '17 at 10:43

sg1994

votes

2 answers

Extract text from pdf file using javascript

I want to extract text from pdf file using only Javascript in the client side without using the server. I've already found a javascript code in the following link: extract text from pdf in Javascript and then in…

javascript pdf text-extraction pdf.js

asked Jul 02 '13 at 11:39

Coccinelle

votes

3 answers

PDF text extraction from given coordinates

I would like to extract text from a portion (using coordinates) of PDF using Ghostscript. Can anyone help me out?

pdf ghostscript text-extraction

asked May 31 '11 at 11:59

AMER

votes

8 answers

Extract all email addresses from bulk text using jquery

I'm having the this text below: sdabhikagathara@rediffmail.com, "assdsdf" , "rodnsdfald ferdfnson" , "Affdmdol Gondfgale" , "truform techno"…

javascript jquery regex text-extraction email-address

asked Jan 21 '13 at 14:11

Milind Anantwar

81,290
25
94
125

votes

10 answers

How to extract text from MS office documents in C#

I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI but i didn't get a sample about how to use NPOI.

c# ms-office text-extraction

asked Jun 18 '09 at 07:20

Elias Haileselassie

1,385
1
18
26

2 3

…

85 86 Next