4

I need to extract text-only content from my thesis document written in LaTeX for an automated anti-plagiarism check. I know only about the "draft" option and it's not enough.

I am supposed to omit:

  • images,
  • tables and other figures,
  • equations,
  • captions and footnotes.

It'd also be nice to remove all the references. The output should be a plain (UTF-8 encoded) text file.

Is there any straightforward way to do this? I don't really fancy copying it manually page-by-page.

Artjom B.
  • 61,146
  • 24
  • 125
  • 222
odiroot
  • 53
  • 1
  • 5
  • Let me guess - your institution's anti-plagiarism software only works on MSWord documents and plain text files? – Spacedman Jan 29 '11 at 14:30
  • Good guess Spacedman, but only plain text. I guess that's reasonable since it's not so easy to automate on their end. – odiroot Jan 29 '11 at 15:54
  • 1
    You might get more answers at the [TeX SE site](http://tex.stackexchange.com). – frabjous Feb 01 '11 at 20:40

5 Answers5

1

Yes: untex, a simple C script. You can also look at detex.

Community
  • 1
  • 1
Francois G
  • 11,957
  • 54
  • 59
1

You could try to use the comment package (or one of a dozen of alternatives) to turn equation, figure, table etc. into commenting environments and \renewcommand\footnote[1]{} to remove footnotes. \pagestyle{empty} should remove page headings etc., so running pdftotext on the result should come close ot what you want.

Ulrich Schwarz
  • 7,598
  • 1
  • 36
  • 48
  • Sounds good. I'm going to try this. I don't really understand the bit about comment package. Does it comment out all the environments automatically or do I have to specify some env list to some command? – odiroot Jan 29 '11 at 14:58
  • 1
    @odiroot: one example, adapted from the `verbatim` package, would look like this: `\usepackage{verbatim}\let\equation=\comment\let\endequation=\endcomment` should suppress all your equation environments. So yes, you'll have a bit of typing doing this for equation, align, ... – Ulrich Schwarz Feb 01 '11 at 20:59
1

You could use a document converter like pandoc, or convert the output PDF to plain text with something like Calibre.

frabjous
  • 1,019
  • 9
  • 13
1

Usually you want some LaTeX processing done on the text, say you have

\newcommand*{\SO}{StackOverflow\index{StackOverflow}\xspace}

...

I spend a lot of time on \SO, blah blah ....

Just filtering out the text paragraph here will not give a text like the intended result when it contains any macros.

Therefore trying to extract things directly from the *.tex file usually will leave much to be wanted from the result. It is typically therefore better to work on output from latex processing. I would recommend to convert latex to html and then from html to text. You will probably need some manual clean-up, but I think it should be relatively close.

hlovdal
  • 26,565
  • 10
  • 94
  • 165
  • Thanks for the html->txt tip. I'm currently replacing footnotes and caption with empty string using `renewcommand`. But I still haven't figured out how to do the same with `tabular(x)` since it has begin- and endtags. – odiroot Feb 02 '11 at 17:32
1

While detex has been mentioned, however there is another project, aimed at improving it. It is called opendetex, give it a look!

Joel Berger
  • 20,180
  • 5
  • 49
  • 104