Extract words from PDF with golang?

Question

I don't understand type conversion. I know this isn't right, all I get is a bunch of hieroglyphs.

f, _ := os.Open("test.pdf") defer f.Close() io.Copy(os.Stdout, f)

I want to work with the strings....

score 10 · Answer 1 · edited Sep 16 '19 at 17:04

10

I tried some go pdf libs, and found sajari/docconv works like I expect.

easy to use, here is a example:

package main

import (
    "fmt"
    "log"

    "code.sajari.com/docconv"
)

func main() {
    res, err := docconv.ConvertPath("your-file.pdf")
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println(res)
}

edited Sep 16 '19 at 17:04

pzkpfw

565
3
21

answered Sep 18 '17 at 07:39

Daoctor

412
5
8

5

Note that the docconv package has dependencies that are only available for Linux – Max Stevens Jul 29 '20 at 19:07
1

If you are using Mac OS, please try to install dependencies via command `brew install poppler` and `brew install tesseract` – Henry S. Oct 03 '22 at 21:33

Le Dong Thuc · Answer 2 · 2017-03-14T03:48:24.627

9

It's because the PDF doesn't only contain the text, but it also contains the formats (fonts, padding, margin, position, shapes, image) information.

In case you need to read the plain text without format. I have forked a repository and implement the function to do that. You can check it at https://github.com/ledongthuc/pdf

I also have put an example, help it useful for you.

package main

import (
    "bytes"
    "fmt"

    "github.com/ledongthuc/pdf"
)

func main() {
    content, err := readPdf("test.pdf") // Read local pdf file
    if err != nil {
        panic(err)
    }
    fmt.Println(content)
    return
}

func readPdf(path string) (string, error) {
    r, err := pdf.Open(path)
    if err != nil {
        return "", err
    }
    totalPage := r.NumPage()

    var textBuilder bytes.Buffer
    for pageIndex := 1; pageIndex <= totalPage; pageIndex++ {
        p := r.Page(pageIndex)
        if p.V.IsNull() {
            continue
        }
        textBuilder.WriteString(p.GetPlainText("\n"))
    }
    return textBuilder.String(), nil
}

edited Mar 14 '17 at 03:48

answered Mar 14 '17 at 03:20

Le Dong Thuc

195
1
6

3

I have a bug with your lib but it's not possible possible to post issue on `ledongthuc/pdf` Git. – LeMoussel May 16 '17 at 16:27
@LeMoussel, not sure why can't you create the issue in my project. But anyway, you can ask send the bug here, I will try to help you – Le Dong Thuc Jun 13 '17 at 11:02
@ Le Dong Thuc : See [How to extract plain text from PDF in golang](https://stackoverflow.com/questions/44560265/how-to-extract-plain-text-from-pdf-in-golang) – LeMoussel Jun 15 '17 at 06:37
@LeMoussel actually, you can: https://softwareengineering.stackexchange.com/questions/179468/forking-a-repo-on-github-but-allowing-new-issues-on-the-fork – KrzysztofSzarek Feb 25 '18 at 17:15
2

@LeDongThuc Using your library, I'm getting the below error: malformed PDF: reading at offset 0: stream not present – Shaik Sadiq Ahmed Jul 07 '21 at 06:50
@ShaikSadiqAhmed were you able to solve your issue? – lumo Dec 07 '22 at 10:50
@LeDongThuc I always get `panic: malformed PDF: reading at offset 0: stream not present` when I run ``` r, err := pdf.Open(path) r.Page(1).Content() ``` For example, this PDF: https://www.cs.utexas.edu/~roshan/CHET.pdf `r.NumPage()` and `r.Outline()` work tho. – flexwang Mar 11 '23 at 23:48

score 5 · Answer 3 · answered Oct 02 '16 at 06:50

all I get is a bunch of hieroglyphs.

What you get is the content of a pdf file, which is not clear text.

If you want to read a pdf file in Go, use one of the golang pdf libraries like rsc.io/pdf, or one of those libraries like yob/pdfreader.

As mentioned here:

I doubt there is any 'solid framework' for this kind of stuff. PDF format isn't meant to be machine-friendly by design, and AFAIK there is no guaranteed way to parse arbitrary PDFs.

Rudolfo Borges · Answer 4 · 2023-06-05T02:30:04.743

0

You can try to use pdf2go lib together with the popular: pdf2go

import (
    "fmt"
    "github.com/rudolfoborges/pdf2go"
)

func main() {
    pdf, err := pdf2go.New("path/to/file.pdf", pdf2go.Config{
        LogLevel: pdf2go.LogLevelError,
    })

    if err != nil {
        panic(err)
    }

    text, err := pdf.Text()
    if err != nil {
        panic(err)
    }

    fmt.Println(text)

    pages, err := pdf.Pages()

    if err != nil {
        panic(err)
    }

    for _, page := range pages {
        fmt.Println(page.Text())
    }
}

edited Jun 05 '23 at 02:30

answered May 28 '23 at 02:57

Rudolfo Borges

1
1

Your answer could be improved by providing an example of the solution and how it helps the OP. – Tyler2P May 28 '23 at 09:55
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/34458815) – AthulMuralidhar May 30 '23 at 12:56

Extract words from PDF with golang?

4 Answers4