25

I want to extract all rows from here while ignoring the column headers as well as all page headers, i.e. Supported Devices.

pdftotext -layout DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - \
 | sed '$d'                                                  \
 | sed -r 's/ +/,/g; s/ //g'                                 \
 > output.csv

The resulting file should be in CSV spreadsheet format (comma separated value fields).

In other words, I want to improve the above command so that the output doesn't brake at all. Any ideas?

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
user706838
  • 5,132
  • 14
  • 54
  • 78

7 Answers7

34

I'll offer you another solution as well.

While in this case the pdftotext method works with reasonable effort, there may be cases where not each page has the same column widths (as your rather benign PDF shows).

Here the not-so-well-known, but pretty cool Free and OpenSource Software Tabula-Extractor is the best choice.

I myself am using the direct GitHub checkout:

$ cd $HOME ; mkdir svn-stuff ; cd svn-stuff
$ git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor

I wrote myself a pretty simple wrapper script like this:

$ cat ~/bin/tabulaextr

 #!/bin/bash
 cd ${HOME}/svn-stuff/git.tabula-extractor/bin
 ./tabula $@

Since ~/bin/ is in my $PATH, I just run

$ tabulaextr --pages all                                 \
         $(pwd)/DAC06E7D1302B790429AF6E84696FCFAB20B.pdf \
        | tee my.csv

to extract all the tables from all pages and convert them to a single CSV file.

The first ten (out of a total of 8727) lines of the CVS look like this:

$ head DAC06E7D1302B790429AF6E84696FCFAB20B.csv 

 Retail Branding,Marketing Name,Device,Model
 "","",AD681H,Smartfren Andromax AD681H
 "","",FJL21,FJL21
 "","",Luno,Luno
 "","",T31,Panasonic T31
 "","",hws7721g,MediaPad 7 Youth 2
 3Q,OC1020A,OC1020A,OC1020A
 7Eleven,IN265,IN265,IN265
 A.O.I. ELECTRONICS FACTORY,A.O.I.,TR10CS1_11,TR10CS1
 AG Mobile,Status,Status,Status

which in the original PDF look like this:

Screenshot from top of first page of sample PDF

It even got these lines on the last page, 293, right:

 nabi,"nabi Big Tab HD\xe2\x84\xa2 20""",DMTAB-NV20A,DMTAB-NV20A
 nabi,"nabi Big Tab HD\xe2\x84\xa2 24""",DMTAB-NV24A,DMTAB-NV24A

which look on the PDF page like this:

last page of sample PDF

TabulaPDF and Tabula-Extractor are really, really cool for jobs like this!


Update

Here is an ASCiinema screencast (which you also can download and re-play locally in your Linux/MacOSX/Unix terminal with the help of the asciinema command line tool), starring tabula-extractor:

asciicast

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
16

As Martin R commented, tabula-java is the new version of tabula-extractor and active. 1.0.0 was released on July 21st, 2017.

Download the jar file and with the latest java:

java -jar ./tabula-1.0.0-jar-with-dependencies.jar \
    --pages=all \
    ./DAC06E7D1302B790429AF6E84696FCFAB20B.pdf
    > support_devices.csv
Nobu
  • 9,965
  • 4
  • 40
  • 47
8

What you want is rather easy, but you're having a different problem also (I'm not sure you are aware of it...).

First, you should add -nopgbrk for ("No pagebreaks, please!") to your command. Because these pesky ^L characters which otherwise appear in the output then need not be filtered out later.

Adding a grep -vE '(Supported Devices|^$)' will then filter out all the lines you do not want, including empty lines, or lines with only spaces:

pdftotext -layout -nopgbrk                           \
   DAC06E7D1302B790429AF6E84696FCFAB20B.pdf -        \
 | grep -vE '(Supported Devices|^$|Marketing Name)'  \
 | gsed '$d'                                         \
 | gsed -r 's# +#,#g'                                \
 | gsed '# ##g'                                      \
 > output2.csv

However, your other problem is this:

  1. Some of the table fields are empty.
  2. Empty fields appear with the -layout option as a series of space characters, sometimes even two in the same row.
  3. However, the text columns are not spaced identically from page to page.
  4. Therefor you will not know from line to line how many spaces you need to regard as a an "empty CSV field" (where you'd need an extra , separator).
  5. As a consequence, your current code will show only one, two or three (instead of four) fields for some lines, and these fields end up in the wrong columns!

There is a workaround for this:

  1. Add the -x ... -y ... -W ... -H ... parameters to pdftotext to crop the PDF column-wise.
  2. Then append the columns with a combination of utilities like paste and column.

The following command extracts the first columns:

pdftotext -layout -x  38 -y 77 -W 176 -H 500  \
          DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 1st-columns.txt

These are for second, third and fourth columns:

pdftotext -layout -x 214 -y 77 -W 176 -H 500  \
          DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 2nd-columns.txt

pdftotext -layout -x 390 -y 77 -W 176 -H 500  \
          DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 3rd-columns.txt

pdftotext -layout -x 567 -y 77 -W 176 -H 500  \
          DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 4th-columns.txt

BTW, I cheated a bit: in order to get a clue about what values to use for -x, -y, -W and -H I did first run this command in order to find the exact coordinates of the column header words:

pdftotext -f 1 -l 1 -layout -bbox \
          DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - | head -n 10

It's always good if you know how to read and make use of pdftotext -h. :-)

Anyway, how to append the four text files as columns side by side, with the proper CVS separator in between, you should find out yourself. Or ask a new question :-)

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • Kurt Pfeifle: How did you measured the x and y coordinates based on the -bbox command? Im measuring inside a pdf viever and get for X and Y, 50 and 100 respectively – riccs_0x Jul 29 '18 at 20:21
  • @riccs_0x: The `pdftotext` command needs PostScript points as units of distance. Does your PDF viewer (which one?!) show PostScript points? I can't remember how exactly I determined the params for above commands. It was more than 3 years ago. If I had to do it again today, I'd use Ghostscript with `'gs -sDEVICE=bbox'` to determine the bounding box of the complete page, then guess-timate the respective params for each column, then fine-tune/modify them depending on first results.... – Kurt Pfeifle Jul 30 '18 at 09:30
  • Im using Evince and Atril – riccs_0x Jul 30 '18 at 14:01
  • Sorry to bother, I know this question is from a time ago. I just faced several times this issue, and I have managed to sort it somehow, but Im looking a more stable approach. Thanks for the great ideas you did here. – riccs_0x Jul 30 '18 at 14:03
1

This can be done easily with an IntelliGet (http://akribiatech.com/intelliget) script as below

userVariables = brand, name, device, model;
{ start = Not(Or(Or(IsSubstring("Supported Devices",Line(0)),
                  IsSubstring("Retail Branding",Line(0))),
                IsEqual(Length(Trim(Line(0))),0))); 
  brand = Trim(Substring(Line(0),10,44));
  name = Trim(Substring(Line(0),45,79));
  device = Trim(Substring(Line(0),80,114));
  model = Trim(Substring(Line(0),115,200));
  output = Concat(brand, ",", name, ",", device, ",", model);
}
NightOwl888
  • 55,572
  • 24
  • 139
  • 212
0

For the case where you want to extract that tabular data from PDF over which you have control at creation time (for timesheets contracts your employees have to sign), the following solution will be cleaner:

  1. Create a PDF form with field IDs.

  2. Let people fill and save the PDF forms.

  3. Use a Apache PDFBox, an open source tool that allows to extract form data from a PDF. It includes a command-line example tool PrintFields that you would call as follows to print the desired field information:

    org.apache.pdfbox.examples.interactive.form.PrintFields file.pdf
    

    For other options, see this question.

As an alternative to the above workflow, maybe you could also use a digital signature web service that allows PDF form filling and export of the data to tables. Such as SignRequest, which allows to create templates and later export the data of signed documents. (Not affiliated, just found this myself.)

tanius
  • 14,003
  • 3
  • 51
  • 63
0

The First answer by Kurt is a good overall method for pulling columns of text, but we can in such cases where wanting to import to eXcel as a spreadsheet, just simply import the text as plain text.

Here is the Windows text output

pdftotext -nopgbrk -layout -fixed 4 -y 40 -H 600 -W 1000 "C:\data\DAC06E7D1302B790429AF6E84696FCFAB20B.pdf" -|find " ">>out.txt

first few lines

                                                                                                  AD681H                                      Smartfren Andromax AD681H
                                                                                                  FJL21                                       FJL21
                                                                                                  Luno                                        Luno
                                                                                                  T31                                         Panasonic T31
                                                                                                  hws7721g                                    MediaPad 7 Youth 2
          3Q                                          OC1020A                                     OC1020A                                     OC1020A
          7Eleven                                     IN265                                       IN265                                       IN265
          A.O.I. ELECTRONICS FACTORY                  A.O.I.                                      TR10CS1_11                                  TR10CS1

The import will then be columnar, including odd quote characters and can be saved as CSV if still desired.

enter image description here enter image description here enter image description here

There are many good "free" text to csv importers to analyse and morph text into other formats such as charts or columns. some can be command lined driven. Here is possibly the most feature rich for free[mium] windows users! using the above sample file.

enter image description here

K J
  • 8,045
  • 3
  • 14
  • 36
0

The question is about the command line, but as I see another answer here that mentions the use of Excel, it's worth mentioning that you can now import PDF files from Excel itself. This saved me a lot of time.

Adapted from instructions here: https://www.makeuseof.com/easily-extract-table-from-pdf/

  1. Open a new Excel spreadsheet.

  2. Go to the Data tab.

  3. In the Get & Transform section, click on Get Data.

  4. From the list, select From File and then select From PDF.

Once you click Open, a navigator window will open in Excel. In this window, you'll see the different tables that the PDF file contains.

  1. Select the table that you want to import.

  2. Click on Load.

I was surpised by how well this works in Windows, after searching for a more technical solution in Ubuntu.

DAB
  • 1,631
  • 19
  • 25
  • @K J Before finding the Excel solution, I checked Libre-Office on Ubuntu but it doesn't seem to have this functionality. – DAB Aug 12 '23 at 12:17