4

I want to get a list of all the colleges in USA from this PDF file and put it into a CSV file. I will then import the CSV file into SQL server (so that I can run queries easily).

I tried several online pdf to csv converters and Java based pdf to CSV tutorials. Nothing worked. I have spent 6-8 hours today for this and failed. My csv files were messed up and I had lot of nulls in my DB when i imported the csv. I even tried searching for a DHS api which could give me this info but found none.

Can someone please help me to extract the colleges exactly like they are shown in the pdf file ?

PS: You can see all the colleges using this url also. BUT, you have to scroll manually to extract all the results. It will take too long and data will not be in format given in pdf file.

Jon 'links in bio' Ericson
  • 20,880
  • 12
  • 98
  • 148
stack1
  • 1,004
  • 2
  • 13
  • 28
  • I wish DHS had made an api for this as well, besides their regular api for college officials and other government related stuff. – stack1 Sep 21 '15 at 06:37
  • I am now using Zamzar to make a docx file and then convert it to csv or excel. I might have to do a lot of manual data fixing :( I am so tired now. – stack1 Sep 21 '15 at 07:05
  • 1
    I would suggest you to use something like http://scrapy.org/ to get the information from the weblink you posted and create your CSV file. Why have you tagged so many computer programming languages to your question? – ρss Sep 21 '15 at 07:44
  • @pss: I guess the OP wants a solution in all of those languages... Pity plain C is missing. Maybe he can add it to the list. – Jongware Sep 21 '15 at 13:00
  • Considering the fairly straight forward page content stream style, the data should be extractable using a not too complicated custom text extractor. I have not checked all pages, though... obviously... – mkl Sep 21 '15 at 13:37
  • the url given could be processed by a fairly simple web driver but if you need the F and M columns then this route would not be useful. – engineersmnky Sep 21 '15 at 17:39
  • Have you tried Acrobat's export to Excel? The problem with exporting from PDF is that PDF per se does not know about paragraphs etc. In PDF, you have text objects placed on the canvas, and that's it. Depending on how the PDF got created, there is additional information available, which can be helpful for exporting; Acrobat does make use of it, and they have (particularly in the newest version, a pretty good export logic. – Max Wyss Sep 21 '15 at 18:31
  • @Jongware - I can add only max 5 tags. Besides, I learned C a looong time ago and would not be able to modify a C answer. – stack1 Sep 21 '15 at 20:17
  • @engineersmnky - Okay. Please put your solution anyway. Something is better than nothing. Btw, why is it not possible to get F and M ? – stack1 Sep 21 '15 at 20:19
  • @mkl - I am a newbie programmer. So, I am not able to understand what you are saying. Can you please tell me more and maybe suggest a tutorial that I can try, preferably in Java or Ruby. – stack1 Sep 21 '15 at 20:20
  • 1
    @stack1 F and M are not displayed with the schools in the School Search. Posting an answer would have to be so implementation specific that I would end up solving the problem for you and would take quite a bit of testing on my part but the basic concept would be read each masonry item in the masonry grid and parse out the name address and Campus id. when you run out scroll to the bottom and process the next set and so on. This process will take a long time as there are 16,000+ schools to process plus the data load – engineersmnky Sep 21 '15 at 20:50
  • @MaxWyss - Ok. I am getting Adobe acrobat DC for desktop. Trial version. If this doesn't fix it, probably manual is the only way out. FYI - It takes a million years to download the adobe trial. – stack1 Sep 21 '15 at 21:11
  • 2
    Suggestion, don't even try a pure programmatic approach, PDF is just not made for that and the more time you spend with programming solutions that almost work but not quite - the more you will get it. Instead what you want to do is find a program that makes the manual work as easy as possible, like pdf2text and then correct the errors by hand. – bbozo Sep 28 '15 at 06:41
  • I agree with bbozo. See my answer for that. I think there is `pdf2txt` and `pdftotext`. The latter is better suited for your needs, imo. – colidyre Sep 28 '15 at 08:56

4 Answers4

10

As already claimed in a comment to the question,

Considering the fairly straight forward page content stream style, the data should be extractable using a not too complicated custom text extractor.

In detail:

The page content stream style

Regular table entry content is drawn entry by entry, each entry field by field in reading order. Thus, while going through the content stream we do not have to try and re-arrange the content to establish that order. This makes this task fairly easy.

So the main work will be to ignore non-entries, i.e. the header on the first page, the bars indicating where a new first letter starts, and the page numbers.

We do so by

  • ignoring graphics and non-black text which takes care of the header and the first letter bars;
  • not accepting entries not starting with data in the SCHOOL NAME column which takes care of the page numbers which only live in the CAMPUS NAME column.

(Other approaches also would have done, e.g. ignoring everything in a bottom page area to take care of the page numbers.)

Now we merely have to split the entries into their fields.

Again the document structure helps, as it is a very uniform document, the table columns have the identical position and dimensions on each page. So we merely have to dissect at fixed x values.

There is just one stumbling block: in some entries atomic text chunks contain content of different columns. E.g. sometimes the contents of the F and M columns are drawn as a single string like "YN" and the optical distance is introduced via character spacing.

So we have to process the text chunks character by character, not as a whole.

A sample implementation

I use Java and the PDF library iText (current version 5.5.7 development snapshot) here. This does not mean at all that it cannot be done a using different setup, this merely is the setup I'm most accustomed to.

As separator I use the tab character because other likely candidates also occur as part of the text and I did not want to have to cope with escaping them.

This is the custom RenderListener class introduced to cope with the content as explained above:

public class CertifiedSchoolListExtractionStrategy implements RenderListener
{
    public CertifiedSchoolListExtractionStrategy(Appendable data, Appendable nonData)
    {
        this.data = data;
        this.nonData = nonData;
    }

    //
    // RenderListener implementation
    //
    @Override
    public void beginTextBlock() { }

    @Override
    public void endTextBlock() { }

    @Override
    public void renderImage(ImageRenderInfo renderInfo) { }

    @Override
    public void renderText(TextRenderInfo renderInfo)
    {
        try
        {
            Vector startPoint = renderInfo.getBaseline().getStartPoint();
            BaseColor fillColor = renderInfo.getFillColor();
            if (fillColor instanceof GrayColor && ((GrayColor)fillColor).getGray() == 0)
            {
                if (debug)
                    data.append(String.format("%4d\t%3.3f %3.3f\t%s\n", chunk, startPoint.get(I1), startPoint.get(I2), renderInfo.getText()));
                for (TextRenderInfo info : renderInfo.getCharacterRenderInfos())
                {
                    renderCharacter(info);
                }
            }
            else
            {
                if (debug)
                    nonData.append(String.format("%4d\t%3.3f %3.3f\t%s\n", chunk, startPoint.get(I1), startPoint.get(I2), renderInfo.getText()));
                if (currentField > -1)
                    finishEntry();
                entryBuilder.append(renderInfo.getText());
            }
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
        finally
        {
            chunk++;
        }
    }

    public void renderCharacter(TextRenderInfo renderInfo) throws IOException
    {
        Vector startPoint = renderInfo.getBaseline().getStartPoint();

        float x = startPoint.get(I1);

        if (currentField > -1)
        {
            if (isInCurrentField(x))
            {
                entryBuilder.append(renderInfo.getText());
                return;
            }
            if (isInNextField(x))
            {
                currentField++;
                entryBuilder.append('\t').append(renderInfo.getText());
                return;
            }
            finishEntry();
        }
        if (isInNextField(x))
        {
            finishEntry();
            currentField = 0;
        }
        entryBuilder.append(renderInfo.getText());
    }

    public void close() throws IOException
    {
        finishEntry();
    }

    boolean isInCurrentField(float x)
    {
        if (currentField == -1)
            return false;

        if (x < fieldstarts[currentField])
            return false;

        if (currentField == fieldstarts.length - 1)
            return true;

        return x <= fieldstarts[currentField + 1];
    }

    boolean isInNextField(float x)
    {
        if (currentField == fieldstarts.length - 1)
            return false;

        if (x < fieldstarts[currentField + 1])
            return false;

        if (currentField == fieldstarts.length - 2)
            return true;

        return x <= fieldstarts[currentField + 2];
    }

    void finishEntry() throws IOException
    {
        if (entryBuilder.length() > 0)
        {
            if (currentField == fieldstarts.length - 1)
            {
                data.append(entryBuilder).append('\n');
            }
            else
            {
                nonData.append(entryBuilder).append('\n');
            }

            entryBuilder.setLength(0);
        }
        currentField = -1;
    }

    //
    // hidden members
    //
    final Appendable data, nonData;
    boolean debug = false;

    int chunk = 0;
    int currentField = -1;
    StringBuilder entryBuilder = new StringBuilder();

    final int[] fieldstarts = {20, 254, 404, 415, 431, 508, 534};
}

(CertifiedSchoolListExtractionStrategy.java)

We can use it like this:

@Test
public void testCertifiedSchoolList_9_16_2015() throws IOException
{
    try (   Writer data = new OutputStreamWriter(new FileOutputStream(new File(RESULT_FOLDER, "data.txt")), "UTF-8");
            Writer nonData = new OutputStreamWriter(new FileOutputStream(new File(RESULT_FOLDER, "non-data.txt")), "UTF-8")    )
    {
        CertifiedSchoolListExtractionStrategy strategy = new CertifiedSchoolListExtractionStrategy(data, nonData);
        PdfReader reader = new PdfReader("certified-school-list-9-16-2015.pdf");

        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        for (int page = 1; page <= reader.getNumberOfPages(); page++)
            parser.processContent(page, strategy);
        strategy.close();
    }
}

(ExtractCertifiedSchoolList.java)

Now data.txt contains all the entries as tab-separated lines and non-data.txt everything ignored.

Behind the scenes

To understand what is happening here, one first has to know how page content in PDFs is organized and how (for the sample code given) iText operates on it.

Inside the PDF

PDF documents are structures built from a number of base object types, some primitive types (numbers, strings, ...) and some more complex ones (arrays or dictionaries of other objects or streams).

A page in a PDF document is represented by such a dictionary object containing entries defining some page properties (like page dimensions) and other entries referencing objects that define what is drawn on the page: the content streams.

Content streams essentially contain a sequence of operations, which may

  • select a color (for stroking or filling),
  • define a path (move to some point, line to some other point, curve to yet another one, ...),
  • stroke or fill such a path,
  • draw some bitmap image somewhere,
  • draw some text somewhere, or
  • do numerous other things.

For the question at hand we mostly are interested in the operations involved in drawing text. In contrast to word processors the operations are not take this long string of text and arrange it as a paragraph but instead more primitively move text position here, draw this short string here, move text position again, and draw another string there.

E.g. in the sample PDF the oeprations for drawing the table header and the first entry line are these:

/TT2 1 Tf

Select font TT2 at size 1.

9.72 0 0 9.72 20.16 687.36 Tm

Set the text matrix to move the text insertion coordinates to 20.16, 687.36 and scale everything following by a factor of 9.72.

0 g

Select the grayscale fill color black

0 Tc
0 Tw

Select additional character and word spacing to 0.

(SCHOOL)Tj

Draw "SCHOOL" here.

/TT1 1 Tf

Select font TT1.

3.4082 0 TD

Move text insertion point by 3.4082 in x direction.

<0003>Tj

Draw a space character (the current font uses a different encoding which uses 16 bit per character, not 8, and here is represented hexadecimally).

/TT2 1 Tf
.2261 0 TD
[(NAME)-17887.4(CAMPUS)]TJ

Select font, move text insertion point, and draw the string "NAME", then a gap of 17887.4 text units, then draw "CAMPUS".

/TT1 1 Tf
24.1809 0 TD
<0003>Tj
/TT2 1 Tf
.2261 0 TD
[(NAME)-8986.6(F)-923.7(M)-459.3(CITY)-6349.9(ST)-1390.2(CAMPUS)]TJ
/TT1 1 Tf
28.5147 0 TD
<0003>Tj
/TT2 1 Tf
.2261 0 TD
(ID)Tj

Draw the rest of the header line.

/TT4 1 Tf
-56.782 -1.3086 TD

Move left by 56.782 and down by 1.3086, i.e. to the start of the first entry line.

("I)Tj
/TT3 1 Tf
.6528 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(Am")Tj
/TT3 1 Tf
1.7783 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(School)Tj
/TT3 1 Tf
2.6919 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
[(Inc.)-16894.2("I)]TJ
/TT3 1 Tf
18.9997 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(Am")Tj
/TT3 1 Tf
1.7783 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(School)Tj
/TT3 1 Tf
2.6919 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
[(Inc.)-8239.9(Y)-1018.9(N)-576.7(Mount)]TJ
/TT3 1 Tf
15.189 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
[(Shasta)-2423.3(CA)-2443.7(41789)]TJ

And draw the first entry line.

As you see and as I had mentioned above, the table content is drawn in reading order. Even multi line column entries come in the needed order, e.g. the campus name "A F International of Westlake Village":

[(Inc.)-7228.7(A)]TJ
/TT3 1 Tf
9.26 0 TD 
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(F)Tj
/TT3 1 Tf
.4595 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(International)Tj
/TT3 1 Tf
5.2886 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(of)Tj
/TT3 1 Tf
.8325 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(Westlake)Tj
/TT3 1 Tf
3.7739 0 TD
<0003>Tj
/TT4 1 Tf
-11.8374 -1.3086 TD

Move down to the second line of the column.

(Village)Tj
15.4938 1.3086 TD

Move up again to the main line of the entry.

[(Y)-1018.9(N)-576.7(Westlake)]TJ 

So we can digest the text as it comes, no need for sorting (the content could be ordered in a completely different way).

But we also see that there are no obvious column start and end points. To associate the text with a column, therefore, we have to calculate the positions of each character and compare them with externally given column start positions.

Parsing supported by libraries

PDF libraries usually provide some mechanism to help parsing such content streams.

There are two basic architectures for this, a library may parse the content stream

  • as a whole and provide it as a big array of positioned text chunks or
  • or piecewise and forward individual positioned text chunks using a listener pattern.

The former variant at first seems easier to handle but may have big resource requirements (I have come across multi-MB content streams), while the second one seems a bit more difficult to handle but has smaller memory requirements.

The library I used (iText) follows the latter approach but your problem could also have been solved using a library following the former one.

RenderListener is the listener interface to implement here, the renderText methods retrieves the tindividual text chunks with positions.

In the implementation above (CertifiedSchoolListExtractionStrategy) the renderText method first checks the fill color associated with the chunk and only forwards black text for further processing in renderCharacter. That method (and some helpers) in turn checks the field the text is in (by hard coded position boundaries) and accordingly exports tab separated values. This logic would similarly have been implemented using other libraries, too.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • How does it get the pdf file ? Here is the exception trace from Junit: http://pastebin.com/FdiALcyw . I get the same NPE when I give it the full path to the pdf file. – stack1 Sep 22 '15 at 18:59
  • If you want to run the sample, check out the whole project. The PDF is read as resource. – mkl Sep 22 '15 at 20:52
  • thanks. i have never used this resource thing before. please tell me which google words to use so that i can search for tutorials on the related concepts. – stack1 Sep 22 '15 at 21:04
  • 1
    First of all, you don't necessarily need to use this mechanism. Simply replace the `new PdfReader(resource)` by `new PdfReader("PATH_TO_YOUR_PDF")` to load from the file system. – mkl Sep 22 '15 at 21:12
  • 1
    Thank you very much indeed ! You are really good at what you do. This could be a good tutorial for pdf and java. Now, I'll use excel to import this as a tab delimited file, then convert to csv and finally load to SQL server. – stack1 Sep 22 '15 at 21:37
  • Btw, I see one strange symbol  frequently only when I open the output file in excel. Wondering if you could tell me what it is and why it happens. I'll search and replace it with excel. – stack1 Sep 22 '15 at 21:39
  • A suggestion - If you work for iText or have your own blog, then please consider making a step by step tutorial based on this question. – stack1 Sep 22 '15 at 21:40
  • *I see one strange symbol  frequently* - the output is UTF-8 encoded. Maybe one has to tell excel so. Or you may want to try a different encoding, cf. the `OutputStreamWriter` constructors. – mkl Sep 23 '15 at 00:39
  • *If you work for iText* - no, I just happen to like certain aspects of iText very much. – mkl Sep 23 '15 at 00:40
  • I hope the "Behind the scenes" edit answers some of your questions. – mkl Sep 28 '15 at 11:42
  • gosh ! I just realized that i gave the +50 bounty to someone else instead of you. How do I correct this ? – stack1 Nov 13 '15 at 19:51
  • I admit that i had been a bit surprised back then. ;) I've no idea, though, whether you can change where the bounty went, let alone how. – mkl Nov 13 '15 at 20:19
8

Another solution without much effort for writing code to get the pdf read: There is a linux tool with a great -layout flag as already mentioned in askubuntu. It's called pdftotext:

$ pdftotext -layout <input.pdf> <output.txt>

It worked very promising for your provided pdf file. Sure, it's not a complete solution for your problem. But all you have to do then is to clean the text-output. This could be less time-sufficient than other solutions.

Here is a sample:

$ head -30 test.txt
                                                                                                          Updated
                                     SEVP Certified Schools                                      September 16, 2015
SCHOOL NAME                                     CAMPUS NAME                            F M CITY                     ST   CAMPUS ID
"I Am" School Inc.                              "I Am" School Inc.                     Y N Mount Shasta             CA     41789
424 Aviation                                    424 Aviation                           N Y Miami                    FL     103705
                                                            ‐ A ‐
A F International School of Languages Inc.      A F International College              Y   N Los Angeles            CA      9538
A F International School of Languages Inc.      A F International of Westlake          Y   N Westlake Village       CA     57589
                                                Village
A. T. Still University of Health Sciences       Kirksville Coll of Osteopathic         Y   N Kirksville         MO         3606
                                                Medicine
Aaron School                                    Aaron School ‐ 30th Street             Y   N   New York             NY    159091
Aaron School                                    Aaron School                           Y   N   New York             NY    114558
ABC Beauty Academy, INC.                        ABC Beauty Academy, INC.               N   Y   Flushing             NY    95879
ABC Beauty Academy, LLC                         ABC Beauty Academy                     N   Y   Garland              TX    50677
Abcott Institute                                Abcott Institute                       N   Y   Southfield           MI    197890
Aberdeen Catholic School System                 Roncalli Primary                       Y   N   Aberdeen             SD    180510
Aberdeen Catholic School System                 Roncalli                               Y   N   Aberdeen             SD    21405
Aberdeen Catholic School System                 Roncalli Elementary                    Y   N   Aberdeen             SD    180511
Aberdeen School District 6‐1                    Aberdeen Central High School           Y   N   Aberdeen             SD    36568
Abiding Savior Lutheran School                  Abiding Savior Lutheran School         Y   N   Lake Forest          CA     9920
Abilene Christian Schools                       Abilene Christian Schools              Y   N   Abilene              TX     8973
Abilene Christian University                    Abilene Christian University           Y   N   Abilene              TX     7498
Abington Friends School                         Abington Friends School                Y   N   Jenkintown           PA    20191
Above It All, Inc                               Benchmark Flight /Hawaii Flight        N   Y   Kailua‐Kona          HI    24353
                                                Academy
Abraham Baldwin Agricultural College            Tifton Campus                          Y   N Tifton             GA         6931
Abraham Joshua Heschel School                   Abraham Joshua Heschel School          Y   N New York           NY        106824

ABT Jacqueline Kennedy Onassis School           ABT Jacqueline Kennedy Onassis         Y   Y New York               NY     52401

So this turns your problem in transforming that text output to a database readable csv file. Maybe you or another could prefer this way of doing it.

colidyre
  • 4,170
  • 12
  • 37
  • 53
  • 2
    As someone who has quite a bit of experience with PDF parsing, this is by far the best approach of all, short of hiring a bangladesh freelancer to do the data entry manually into Excel – bbozo Sep 28 '15 at 06:38
  • colidyre - I accidentally gave the bounty to you instead of the answer I selected. Can you please give the bounty to the correct answer ? Sorry for the confusion. – stack1 Nov 14 '15 at 23:12
  • 1
    Dear @stack1. I cannot see how to do that, unfortunately. Imho, my answer is the best one, because it's very simple and do not need any complicated implementations. So I had to assume that everything was ok. Nevertheless, if it was accidentally done, I have no problem to give the bounty away, but I can't simply see how to do that. As written on [help site](http://stackoverflow.com/help/bounty) bounty seems to be permanent. – colidyre Nov 15 '15 at 09:20
1

I once had a Ruby project that did this kind of work. I used the gem pdf/reader and evenutally it dit work but I advise against using this aproach, the contents of your PDF have no markers where the fields start and stop, instead you have to measure the position of each piece of text (and there are many pieces per field), here an example of the first field

"I
NUL ETX
Am"
NUL ETX
School
NUL ETX
Inc.

and compare it with boundaries you have to find by experimenting like "if the position is > 2.54cm from the left margin and < 5.78cm from the left margin" etc.. It is tedious and error prone.

The easiest solution is to somehow read the entire textcontents of your second url by manually scrolling, selecting and copying the contents into an editor and remove the head and tail extra's or use some web scraping gem like mechanize and then convert this text to CSV. The last part is easy since the structure is fixed

"I Am" School
118 Siskiyou Avenue
Mount Shasta , CA , 96067
5309266263  <--end of first record
424 Aviation
13230 SW 132 Ave.
Miami , FL , 33186
7862424848  <--end of second record

If you need help with this last part, no problem

If this is a once operation you could also use a tool like able2extract (if you'r on windows) it reads pdf and saves in Excel, the times I used it the result was decent and layout was intact.

peter
  • 41,770
  • 5
  • 64
  • 108
  • Note: comment deleted, because attached at the wrong place – Max Wyss Sep 21 '15 at 18:28
  • able2extract messes things. Besides, it has only 3 page conversion limit and the full thing costs $100. If you have the licensed version, can you please convert it to excel for me and share ? – stack1 Sep 21 '15 at 22:56
  • Can you please help me with the last part, ie using mechanize gem to extract data from url instead of pdf ? – stack1 Sep 21 '15 at 22:57
  • i'm at work, srr no time for that, but we have a licence, if you give me a mailaddress I'll send the conversion, which is perfect – peter Sep 22 '15 at 07:36
1

For this kind of PDF extraction, you can use IntelliGet (http://akribiatech.com/intelliget) . A simple script like below would serve your purpose here

userVariables = school, campus;
{ start = IsNumeric(Substring(Line(0),112,115)); 
  school = ""; campus = "";
  { start = 1;
    maxCount = 2;
    school = Concat(school, " ", Trim(Substring(Line(0),1,52)));
    campus = Concat(campus, " ", Trim(Substring(Line(0),53,82)));
  }
  output = Concat(Trim(school), "|", Trim(campus));
}