1

I used the read_pdf from tabula to read the table present in a PDF by specifying the area parameters. I wish to retain the table structure as is (including the lines in between the columns and rows (if applicable)). I read that matplotlib can be used to do, but when I try to put the read table into a CSV, the table structure vanishes and there are only spaces between rows of column. My code-

from tabula import read_pdf
import csv
path = "---"
df = read_pdf(path, stream=True , encoding="utf-8", guess = False, nospreadsheet = True, area = (112.37, 35.34, 153.36, 212.43))
print(df)
df.to_csv("path to destination csv file")

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.xaxis.set_visible(False) 
ax.yaxis.set_visible(False)

ax.table(cellText=df.values, colLabels=df.columns, loc='center')
fig.tight_layout()
plt.show()

When I look into the content of the destination csv file, the lines between columns are not retained. For example, in the below given PDF, i wish to read data from the table and put it into a csv file by retaining the lines between columns, whereas my code doesn't retain the lines.

[1]

whereas i want my code to produce a csv file that bifurcates or draws lines between the columns like this-

required table in csv

The pdf included here is sample. My original pdf is displaying the following output when i used matplotlib. enter image description here, whereas i want it to look like this-> (only the part inside the black lines with the bifurcation)enter image description here

developer
  • 257
  • 1
  • 3
  • 15
  • the [tag:tablua] does not apply here - you are not using a Java GUI to convert pdf. same for [tag:matplotlib] - @Scotty1 removed the [tag:csv] as well - no idea why - this is about a (misconcepted) csv. – Patrick Artner Jan 04 '19 at 10:01
  • @PatrickArtner, I have a pdf, that I converted into a text file, after which, i used the original pdf to locate the place of the table and pull back only the table content along with the structure and store it into a csv. Matplotlib can be used to show up such table structures in a neat, tabular way right?Used these sources- https://stackoverflow.com/questions/45457054/tabula-extract-tables-by-area-coordinates, https://stackoverflow.com/questions/32137396/how-do-i-plot-only-a-table-in-matplotlib – developer Jan 04 '19 at 10:07
  • @PatrickArtner I also removed CSV, since the problem is, in my opinion, not directly related to csv. developer asks for a table style with lines between the columns (or as I'd recommend: a list of values in the second column to be able to work with the comma separated format), which is not specific to CSV, nor any other common table style. So I thought adding the tag `csv` because of one single line of code, which is not even the "core" of the question, was not justified. – JE_Muc Jan 04 '19 at 10:11
  • @Scotty1- sure. Understood – developer Jan 04 '19 at 10:13
  • @developer the tags guide SO users to your question. Ask yourself: what part of your questions would a "matplotlib"-experienced user interest - you never show what you did with mathplotlib so there is no "code to improve" - hence mathplotlib should not be tagged imho - you also used windows or linux as system .. you would not want to tag windows or linux .. it has nothing to do with the content of your question – Patrick Artner Jan 04 '19 at 10:20
  • @PatrickArtner, I have edited my code to show the matplotlib part where i tried to use it to generate a table from the dataframe i had. – developer Jan 04 '19 at 10:24
  • Matplotlib tables do have lines in between rows, so I'm a bit confused to what the question is. Can you show a screenshot of the outcome of your code and use it to explain what is different to the desired outcome? – ImportanceOfBeingErnest Jan 04 '19 at 10:27
  • @ImportanceOfBeingErnest , I have edited my question to show the output i am getting and the output i desire to get – developer Jan 04 '19 at 10:35
  • Neither dataframes, nor matplotlib tables know the concept of "column-spanning". This would need to be done manually. – ImportanceOfBeingErnest Jan 04 '19 at 10:41
  • @ImportanceOfBeingErnest Okay thank you – developer Jan 04 '19 at 10:45

2 Answers2

0

The csv-format (c[haracter/omma] separated values) are values seperated by (the same) seperator character. csv is text-based - there are no "lines" in it.

There are unicode characters that can be used to "form lines":

    U+2500    ─   e2 94 80    ─ ─   BOX DRAWINGS LIGHT HORIZONTAL
    U+2501    ━   e2 94 81    ━ ━   BOX DRAWINGS HEAVY HORIZONTAL
    U+2502    │   e2 94 82    │ │   BOX DRAWINGS LIGHT VERTICAL
    U+2503    ┃   e2 94 83    ┃ ┃   BOX DRAWINGS HEAVY VERTICAL
... snipp ...
    U+250C    ┌   e2 94 8c    ┌ ┌   BOX DRAWINGS LIGHT DOWN AND RIGHT
    U+250D    ┍   e2 94 8d    ┍ ┍   BOX DRAWINGS DOWN LIGHT AND RIGHT HEAVY
    U+250E    ┎   e2 94 8e    ┎ ┎   BOX DRAWINGS DOWN HEAVY AND RIGHT LIGHT
... snipp ...
    U+2533    ┳   e2 94 b3    ┳ ┳   BOX DRAWINGS HEAVY DOWN AND HORIZONTAL
    U+2534    ┴   e2 94 b4    ┴ ┴   BOX DRAWINGS LIGHT UP AND HORIZONTAL
    U+2535    ┵   e2 94 b5    ┵ ┵   BOX DRAWINGS LEFT HEAVY AND RIGHT UP LIGHT
... snipp ...
    U+2548    ╈   e2 95 88    ╈ ╈   BOX DRAWINGS UP LIGHT AND DOWN HORIZONTAL HEAVY
    U+2549    ╉   e2 95 89    ╉ ╉   BOX DRAWINGS RIGHT LIGHT AND LEFT VERTICAL HEAVY
    U+254A    ╊   e2 95 8a    ╊ ╊   BOX DRAWINGS LEFT LIGHT AND RIGHT VERTICAL HEAVY
    U+254B    ╋   e2 95 8b    ╋ ╋   BOX DRAWINGS HEAVY VERTICAL AND HORIZONTAL

source

But you would have to "typeset" each line piece on its own to mimic a table in unicode:

┍━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃   tata         ┃   ta    ┃    fo   ┃  ka     ┃
┝━━━━━━━━━━━━━━━╈━━━━━━━━╈━━━━━━━━╈━━━━━━━━┧
┃   1234         ┃   45    ┃   79    ┃  45     ┃
┝━━━━━━━━━━━━━━━╈━━━━━━━━╈━━━━━━━━╈━━━━━━━━┧
┃   1234         ┃   45    ┃   79    ┃  45     ┃
┕━━━━━━━━━━━━━━━┻━━━━━━━━┻━━━━━━━━┻━━━━━━━━┛

However this is no csv - it is more like human readable ascii-art. The corresponding csv would be:

tata,ta,fo,ka
1234,45,79,45
1234,45,79,45

(if using , as seperator char - replace with char you like better: [" ","|",";",\t] )


Disclaimer:

I am purposefully bad at ascii-art and choose not to match exactly corresponding unicode lines (LIGHT, HEAVY) to get my point across. This is on purpose - call me lazy.

Community
  • 1
  • 1
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
0

Seperate answer - this code formats a "well-defined" csv

tata,ta,for,kattatom
1234,45,79,45
1234,45,79,45

into a "utf8-art":

┌────┬──┬───┬────────┐
│tata│ta│for│kattatom│
├────┼──┼───┼────────┤
│1234│45│79 │45      │
├────┼──┼───┼────────┤
│1234│45│79 │45      │
└────┴──┴───┴────────┘

The utf8-art is appended to the file output.txt.


import csv 

def create_table(file_name):
    """Takes a file_name to a csv. Produces utf8-art of the data. 
    Missing columns will be assumed to miss at end and replaced 
    by empty columns."""
    # mostly untested code - works for the 2 examples mentioned here
    with open(file_name,"r") as f:
        reader = csv.reader(f) 
        w = get_widths(reader)
        row_count = w["last"] 
        del w["last"]
        f.seek(0)
        return create_table_string(reader, w, row_count)

def get_widths(csv_reader):
    widths = {}
    row_count = 0
    for row in csv_reader:
        if row: # ignore empties
            row_count += 1
            for idx,data in enumerate(row):
                widths[idx] = max(widths.get(idx,0),len(data))

    widths["last"] = row_count
    return widths

# supply other set of lines if you like
deco = {k:v for k,v in zip("hv012345678","─│┌┬┐├┼┤└┴┘")} 


def base_row(widths, row, max_key, _v, _h, _l, _m, _r):
    decoration = []
    text_data = []

    decoration.append(_l + _h*widths[0])
    for i in range(1,max_key):
        decoration.append(_m + _h*widths[i])
    decoration.append(_m + _h*widths[max_key] + _r)

    if row:
        for i,data in enumerate(row): 
            text_data.append(_v + "{:<{}}".format(data, widths[i]))

        for empty in range(i+1,max_key+1):
            text_data.append(_v + " "*widths[empty])
        text_data[-1]+=_v

    return [decoration, text_data]

def get_first_row(widths,row):
    return base_row(widths, row, max(widths.keys()), deco["v"], deco["h"], 
                    deco["0"], deco["1"], deco["2"])

def get_middle_row(widths,row):
    return base_row(widths, row, max(widths.keys()),  deco["v"], deco["h"],
                    deco["3"], deco["4"], deco["5"])

def get_last_row(widths):
    decoration, _ = base_row(widths, [], max(widths.keys()), deco["v"], 
                             deco["h"], deco["6"], deco["7"], deco["8"])

    return [decoration]


def create_table_string(reader, widths, row_count): 
    output = []
    r = 0 
    for row in reader:
        if row:
            r += 1
            if r==1:
                output.extend(get_first_row(widths, row))
            else:
                output.extend(get_middle_row(widths, row))

    output.extend( get_last_row(widths))
    return output

Usage:

#create sample csv
with open("data.csv","w") as f:
    f.write("""tata,ta,for,kattatom
1234,45,79,45
1234,45,79,45""")

# open outputfile for append
with open("output.txt", "a", encoding="UTF8") as output:
    output.write("\n" + "-" * 40 + "\n\n")

    # get utf8 art
    for line in create_table("data.csv"):
        output.write(''.join(line)+"\n")

Input csv:

tata,ta,for,kattatom
1234,45,79,45
1234,45,79,45

then:

tata,ta,for,kattatom
1234,45,79,45,8,0
1234,45,79,45

Output:

┌────┬──┬───┬────────┐
│tata│ta│for│kattatom│
├────┼──┼───┼────────┤
│1234│45│79 │45      │
├────┼──┼───┼────────┤
│1234│45│79 │45      │
└────┴──┴───┴────────┘

----------------------------------------

┌────┬──┬───┬────────┬─┬─┐
│tata│ta│for│kattatom│ │ │
├────┼──┼───┼────────┼─┼─┤
│1234│45│79 │45      │8│0│
├────┼──┼───┼────────┼─┼─┤
│1234│45│79 │45      │ │ │
└────┴──┴───┴────────┴─┴─┘
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69