How do I extract significant text content from a LaTeX document

Question

I need to extract text-only content from my thesis document written in LaTeX for an automated anti-plagiarism check. I know only about the "draft" option and it's not enough.

I am supposed to omit:

images,
tables and other figures,
equations,
captions and footnotes.

It'd also be nice to remove all the references. The output should be a plain (UTF-8 encoded) text file.

Is there any straightforward way to do this? I don't really fancy copying it manually page-by-page.

Let me guess - your institution's anti-plagiarism software only works on MSWord documents and plain text files? — Spacedman, Jan 29 '11 at 14:30
Good guess Spacedman, but only plain text. I guess that's reasonable since it's not so easy to automate on their end. — odiroot, Jan 29 '11 at 15:54
You might get more answers at the [TeX SE site](http://tex.stackexchange.com). — frabjous, Feb 01 '11 at 20:40

score 1 · Answer 1 · edited Jun 27 '21 at 05:23

1

Yes: untex, a simple C script. You can also look at detex.

edited Jun 27 '21 at 05:23

Community

1
1

answered Jan 29 '11 at 14:04

Francois G

11,957
54
59

I tried detex, it does help but still produces a lot of cruft. Thanks anyway. – odiroot Jan 29 '11 at 14:54

score 1 · Accepted Answer · answered Jan 29 '11 at 14:07

1

You could try to use the comment package (or one of a dozen of alternatives) to turn equation, figure, table etc. into commenting environments and \renewcommand\footnote[1]{} to remove footnotes. \pagestyle{empty} should remove page headings etc., so running pdftotext on the result should come close ot what you want.

answered Jan 29 '11 at 14:07

Ulrich Schwarz

7,598
1
36
48

Sounds good. I'm going to try this. I don't really understand the bit about comment package. Does it comment out all the environments automatically or do I have to specify some env list to some command? – odiroot Jan 29 '11 at 14:58
1

@odiroot: one example, adapted from the `verbatim` package, would look like this: `\usepackage{verbatim}\let\equation=\comment\let\endequation=\endcomment` should suppress all your equation environments. So yes, you'll have a bit of typing doing this for equation, align, ... – Ulrich Schwarz Feb 01 '11 at 20:59

score 1 · Answer 3 · answered Feb 01 '11 at 20:42

1

You could use a document converter like pandoc, or convert the output PDF to plain text with something like Calibre.

answered Feb 01 '11 at 20:42

frabjous

1,019
9
13

score 1 · Answer 4 · answered Feb 01 '11 at 22:34

Usually you want some LaTeX processing done on the text, say you have

\newcommand*{\SO}{StackOverflow\index{StackOverflow}\xspace}

...

I spend a lot of time on \SO, blah blah ....

Just filtering out the text paragraph here will not give a text like the intended result when it contains any macros.

Therefore trying to extract things directly from the *.tex file usually will leave much to be wanted from the result. It is typically therefore better to work on output from latex processing. I would recommend to convert latex to html and then from html to text. You will probably need some manual clean-up, but I think it should be relatively close.

Thanks for the html->txt tip. I'm currently replacing footnotes and caption with empty string using `renewcommand`. But I still haven't figured out how to do the same with `tabular(x)` since it has begin- and endtags. — odiroot, Feb 02 '11 at 17:32

score 1 · Answer 5 · answered Feb 04 '11 at 03:03

1

While detex has been mentioned, however there is another project, aimed at improving it. It is called opendetex, give it a look!

answered Feb 04 '11 at 03:03

Joel Berger

20,180
5
49
104

How do I extract significant text content from a LaTeX document

5 Answers5