53

I have a Pandas Dataframe generated from a database, which has data with mixed encodings. For example:

+----+-------------------------+----------+------------+------------------------------------------------+--------------------------------------------------------+--------------+-----------------------+
| ID | path                    | language | date       | longest_sentence                               | shortest_sentence                                      | number_words | readability_consensus |
+----+-------------------------+----------+------------+------------------------------------------------+--------------------------------------------------------+--------------+-----------------------+
| 0  | data/Eng/Sagitarius.txt | Eng      | 2015-09-17 | With administrative experience in the prepa... | I am able to relocate internationally on short not...  | 306          | 11th and 12th grade   |
+----+-------------------------+----------+------------+------------------------------------------------+--------------------------------------------------------+--------------+-----------------------+
| 31 | data/Nor/Høylandet.txt  | Nor      | 2015-07-22 | Høgskolen i Østfold er et eksempel...          | Som skuespiller har jeg både...                        | 253          | 15th and 16th grade   |
+----+-------------------------+----------+------------+------------------------------------------------+--------------------------------------------------------+--------------+-----------------------+

As seen there is a mix of English and Norwegian (encoded as ISO-8859-1 in the database I think). I need to get the contents of this Dataframe output as a Markdown table, but without getting problems with encoding. I followed this answer (from the question Generate Markdown tables?) and got the following:

import sys, sqlite3

db = sqlite3.connect("Applications.db")
df = pd.read_sql_query("SELECT path, language, date, longest_sentence, shortest_sentence, number_words, readability_consensus FROM applications ORDER BY date(date) DESC", db)
db.close()

rows = []
for index, row in df.iterrows():
    items = (row['date'], 
             row['path'], 
             row['language'], 
             row['shortest_sentence'],
             row['longest_sentence'], 
             row['number_words'], 
             row['readability_consensus'])
    rows.append(items)

headings = ['Date', 
            'Path', 
            'Language',
            'Shortest Sentence', 
            'Longest Sentence since', 
            'Words',
            'Grade level']

fields = [0, 1, 2, 3, 4, 5, 6]
align = [('^', '<'), ('^', '^'), ('^', '<'), ('^', '^'), ('^', '>'),
         ('^','^'), ('^','^')]

table(sys.stdout, rows, fields, headings, align)

However, this yields an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 72: ordinal not in range(128) error. How can I output the Dataframe as a Markdown table? That is, for the purpose of storing this code in a file for use in writing a Markdown document. I need the output to look like this:

| ID | path                    | language | date       | longest_sentence                               | shortest_sentence                                      | number_words | readability_consensus |
|----|-------------------------|----------|------------|------------------------------------------------|--------------------------------------------------------|--------------|-----------------------|
| 0  | data/Eng/Sagitarius.txt | Eng      | 2015-09-17 | With administrative experience in the prepa... | I am able to relocate internationally on short not...  | 306          | 11th and 12th grade   |
| 31 | data/Nor/Høylandet.txt  | Nor      | 2015-07-22 | Høgskolen i Østfold er et eksempel...          | Som skuespiller har jeg både...                        | 253          | 15th and 16th grade   |
Community
  • 1
  • 1
OleVik
  • 1,222
  • 3
  • 14
  • 28
  • Try this: `new_string = string.decode('latin1')`, where string = "Høgskolen i Østfold er et eksempel..." In other words use `decode('latin1')` on your string to decode Norwegian characters. – Rohit Oct 17 '15 at 06:16
  • I set up a switch structure for decoding when ``language == 'Nor'``: enc = 'latin1' items = (row['date'].decode(enc), row['path'].decode(enc), row['language'].decode(enc), row['shortest_sentence'].decode(enc), row['longest_sentence'].decode(enc) ... But still get: ``UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 72: ordinal not in range(128)``, specifically on ``row['longest_sentence'].decode(enc)``. – OleVik Oct 17 '15 at 12:08

13 Answers13

44

Pandas 1.0 was released 29 january 2020 and supports markdown conversion, so you can now do this directly!

Example taken from the docs:

df = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=['a', 'a', 'b'])
print(df.to_markdown())
|    |   A |   B |
|:---|----:|----:|
| a  |   1 |   1 |
| a  |   2 |   2 |
| b  |   3 |   3 |

Or without the index:

print(df.to_markdown(index=False)) # use 'showindex' for pandas < 1.1
|   A |   B |
|----:|----:|
|   1 |   1 |
|   2 |   2 |
|   3 |   3 |
timvink
  • 636
  • 6
  • 7
  • Any way to remove the index when converting to markdown? Passing `index=False` didn't work. – stefanbschneider Mar 13 '20 at 15:29
  • 6
    Yes: `df.to_markdown(showindex=False)`. Pandas uses [tabulate](https://github.com/astanin/python-tabulate), so you can pass tabulate parameters via to_markdown(). – timvink Mar 15 '20 at 13:32
  • FutureWarning: `showindex` is deprecated. Only 'index' will be used in a future version. – Carson Aug 13 '20 at 02:53
  • Thanks, indeed futurewarning shows up with `pandas>=1.1`. Updated answer. – timvink Aug 14 '20 at 11:40
40

Improving the answer further, for use in IPython Notebook:

def pandas_df_to_markdown_table(df):
    from IPython.display import Markdown, display
    fmt = ['---' for i in range(len(df.columns))]
    df_fmt = pd.DataFrame([fmt], columns=df.columns)
    df_formatted = pd.concat([df_fmt, df])
    display(Markdown(df_formatted.to_csv(sep="|", index=False)))

pandas_df_to_markdown_table(infodf)

Or use tabulate:

pip install tabulate

Examples of use are in the documentation.

Update

As of pandas 1.0 DataFrame to markdown is available. Please see answer from @timvink (docs)

ap14
  • 4,393
  • 1
  • 15
  • 30
kpykc
  • 663
  • 6
  • 8
39

I recommend python-tabulate library for generating ascii-tables. The library supports pandas.DataFrame as well.

Here is how to use it:

from pandas import DataFrame
from tabulate import tabulate

df = DataFrame({
    "weekday": ["monday", "thursday", "wednesday"],
    "temperature": [20, 30, 25],
    "precipitation": [100, 200, 150],
}).set_index("weekday")

print(tabulate(df, tablefmt="pipe", headers="keys"))

Output:

| weekday   |   temperature |   precipitation |
|:----------|--------------:|----------------:|
| monday    |            20 |             100 |
| thursday  |            30 |             200 |
| wednesday |            25 |             150 |
slhck
  • 36,575
  • 28
  • 148
  • 201
user4157482
  • 809
  • 11
  • 18
11

Try this out. I got it to work.

See the screenshot of my markdown file converted to HTML at the end of this answer.

import pandas as pd

# You don't need these two lines
# as you already have your DataFrame in memory
df = pd.read_csv("nor.txt", sep="|")
df.drop(df.columns[-1], axis=1)

# Get column names
cols = df.columns

# Create a new DataFrame with just the markdown
# strings
df2 = pd.DataFrame([['---',]*len(cols)], columns=cols)

#Create a new concatenated DataFrame
df3 = pd.concat([df2, df])

#Save as markdown
df3.to_csv("nor.md", sep="|", index=False)

My output in HTML format by converting HTML to Markdown

Rohit
  • 5,840
  • 13
  • 42
  • 65
  • Relative to my dataframe there were a couple extra separators, but running this: ``cols = df.columns df2 = pd.DataFrame([['---','---','---','---','---','---','---']], columns=cols) df3 = pd.concat([df2, df]) df3.to_csv("nor.md", sep="|", index=False)`` Still gives a encoding error: ``UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 72: ordinal not in range(128)``. Also, out of curiosity, would it be problematic that ``shortest_sentence`` and ``longest_sentence`` may contain line breaks? – OleVik Oct 17 '15 at 14:58
  • I do not think the line breaks should matter as long as they show up fine in pandas DataFrame. Pandas should be able to take in unicode characters just fine so I am not sure why this does not work. Check out this question: (http://stackoverflow.com/questions/7315629/python-encoding-string-swedish-letters). I do not know how close Swedish is to Norwegian but you may find something helpful. – Rohit Oct 17 '15 at 15:30
  • If you find an answer to your problem, do answer your own question for posterity. – Rohit Oct 17 '15 at 15:35
  • Through some further testing I was able to, and using the elegant solution you proposed. Thanks for your help! – OleVik Oct 17 '15 at 17:54
7

Export a DataFrame to markdown

I created the following function for exporting a pandas.DataFrame to markdown in Python:

def df_to_markdown(df, float_format='%.2g'):
    """
    Export a pandas.DataFrame to markdown-formatted text.
    DataFrame should not contain any `|` characters.
    """
    from os import linesep
    return linesep.join([
        '|'.join(df.columns),
        '|'.join(4 * '-' for i in df.columns),
        df.to_csv(sep='|', index=False, header=False, float_format=float_format)
    ]).replace('|', ' | ')

This function may not automatically fix the encoding issues of the OP, but that is a different issue than converting from pandas to markdown.

Daniel Himmelstein
  • 1,759
  • 1
  • 21
  • 26
6

I have tried several of the above solutions in this post and found this worked most consistently.

To convert a pandas data frame to a markdown table I suggest using pytablewriter. Using the data provided in this post:

import pandas as pd
import pytablewriter
from StringIO import StringIO

c = StringIO("""ID, path,language, date,longest_sentence, shortest_sentence, number_words , readability_consensus 
0, data/Eng/Sagitarius.txt , Eng, 2015-09-17 , With administrative experience in the prepa... , I am able to relocate internationally on short not..., 306, 11th and 12th grade
31 , data/Nor/Høylandet.txt  , Nor, 2015-07-22 , Høgskolen i Østfold er et eksempel..., Som skuespiller har jeg både..., 253, 15th and 16th grade
""")
df = pd.read_csv(c,sep=',',index_col=['ID'])

writer = pytablewriter.MarkdownTableWriter()
writer.table_name = "example_table"
writer.header_list = list(df.columns.values)
writer.value_matrix = df.values.tolist()
writer.write_table()

This results in:

# example_table
ID |           path           |language|    date    |                longest_sentence                |                   shortest_sentence                  | number_words | readability_consensus 
--:|--------------------------|--------|------------|------------------------------------------------|------------------------------------------------------|-------------:|-----------------------
  0| data/Eng/Sagitarius.txt  | Eng    | 2015-09-17 | With administrative experience in the prepa... | I am able to relocate internationally on short not...|           306| 11th and 12th grade   
 31| data/Nor/Høylandet.txt  | Nor    | 2015-07-22 | Høgskolen i Østfold er et eksempel...        | Som skuespiller har jeg både...                      |           253| 15th and 16th grade   

Here is a markdown rendered screenshot.

enter image description here

dubbbdan
  • 2,650
  • 1
  • 25
  • 43
2

For those looking for how to do this using tabulate, I thought I'd put this here to save you some time:

print(tabulate(df, tablefmt="pipe", headers="keys", showindex=False))

Anake
  • 7,201
  • 12
  • 45
  • 59
2

Yet another solution. This time via thin wrapper around tabulate: tabulatehelper

import numpy as np
import pandas as pd
import tabulatehelper as th

df = pd.DataFrame(np.random.random(16).reshape(4, 4), columns=('a', 'b', 'c', 'd'))
print(th.md_table(df, formats={-1: 'c'}))

Output:

|        a |        b |        c |        d |
|---------:|---------:|---------:|:--------:|
| 0.413284 | 0.932373 | 0.277797 | 0.646333 |
| 0.552731 | 0.381826 | 0.141727 | 0.2483   |
| 0.779889 | 0.012458 | 0.308352 | 0.650859 |
| 0.301109 | 0.982111 | 0.994024 | 0.43551  |
2

Pandas have merged a PR to support df.to_markdown() method. You can find more details here It should be available soon.

sam
  • 2,263
  • 22
  • 34
1

Right, so I've taken a leaf from a question suggested by Rohit (Python - Encoding string - Swedish Letters), extended his answer, and came up with the following:

# Enforce UTF-8 encoding
import sys
stdin, stdout = sys.stdin, sys.stdout
reload(sys)
sys.stdin, sys.stdout = stdin, stdout
sys.setdefaultencoding('UTF-8')

# SQLite3 database
import sqlite3
# Pandas: Data structures and data analysis tools
import pandas as pd

# Read database, attach as Pandas dataframe
db = sqlite3.connect("Applications.db")
df = pd.read_sql_query("SELECT path, language, date, shortest_sentence, longest_sentence, number_words, readability_consensus FROM applications ORDER BY date(date) DESC", db)
db.close()
df.columns = ['Path', 'Language', 'Date', 'Shortest Sentence', 'Longest Sentence', 'Words', 'Readability Consensus']

# Parse Dataframe and apply Markdown, then save as 'table.md'
cols = df.columns
df2 = pd.DataFrame([['---','---','---','---','---','---','---']], columns=cols)
df3 = pd.concat([df2, df])
df3.to_csv("table.md", sep="|", index=False)

An important precursor to this is that the shortest_sentence and longest_sentence columns do not contain unnecessary line breaks, as removed by applying .replace('\n', ' ').replace('\r', '') to them before submitting into the SQLite database. It appears that the solution is not to enforce the language-specific encoding (ISO-8859-1 for Norwegian), but rather that UTF-8 is used instead of the default ASCII.

I ran this through my IPython notebook (Python 2.7.10) and got a table like the following (fixed spacing for appearance here):

| Path                    | Language | Date       | Shortest Sentence                                                                            | Longest Sentence                                                                                                                                                                                                                                         | Words | Readability Consensus |
|-------------------------|----------|------------|----------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|-----------------------|
| data/Eng/Something1.txt | Eng      | 2015-09-17 | I am able to relocate to London on short notice.                                             | With my administrative experience in the preparation of the structure and content of seminars in various courses, and critiquing academic papers on various levels, I am confident that I can execute the work required as an editorial assistant.       | 306   | 11th and 12th grade   |
| data/Nor/NoeNorrønt.txt | Nor      | 2015-09-17 | Jeg har grundig kjennskap til Microsoft Office og Adobe.                                     | I løpet av studiene har jeg vært salgsmedarbeider for et større konsern, hvor jeg solgte forsikring til studentene og de faglige ansatte ved universitetet i Trønderlag, samt renholdsarbeider i et annet, hvor jeg i en periode var avdelingsansvarlig. | 205   | 18th and 19th grade   |
| data/Nor/Ørret.txt.txt  | Nor      | 2015-09-17 | Jeg håper på positiv tilbakemelding, og møter naturligvis til intervju hvis det er ønskelig. | I løpet av studiene har jeg vært salgsmedarbeider for et større konsern, hvor jeg solgte forsikring til studentene og de faglige ansatte ved universitetet i Trønderlag, samt renholdsarbeider i et annet, hvor jeg i en periode var avdelingsansvarlig. | 160   | 18th and 19th grade   |

Thus, a Markdown table without problems with encoding.

Community
  • 1
  • 1
OleVik
  • 1,222
  • 3
  • 14
  • 28
  • Calling `sys.setdefaultencoding('UTF-8')` is an ultra bad idea. It masks all kind of issues and it looks like your new code doesn't need it as you're not calling `table()`, which is full of implied encodings. – Alastair McCormack Dec 26 '15 at 12:18
1

Here's an example function using pytablewriter and some regular expressions to make the markdown table more similar to how a dataframe looks on Jupyter (with the row headers bold).

import io
import re
import pandas as pd
import pytablewriter

def df_to_markdown(df):
    """
    Converts Pandas DataFrame to markdown table,
    making the index bold (as in Jupyter) unless it's a
    pd.RangeIndex, in which case the index is completely dropped.
    Returns a string containing markdown table.
    """
    isRangeIndex = isinstance(df.index, pd.RangeIndex)
    if not isRangeIndex:
        df = df.reset_index()
    writer = pytablewriter.MarkdownTableWriter()
    writer.stream = io.StringIO()
    writer.header_list = df.columns
    writer.value_matrix = df.values
    writer.write_table()
    writer.stream.seek(0)
    table = writer.stream.readlines()

    if isRangeIndex:
        return ''.join(table)
    else:
        # Make the indexes bold
        new_table = table[:2]
        for line in table[2:]:
            new_table.append(re.sub('^(.*?)\|', r'**\1**|', line))    

        return ''.join(new_table)
Gustavo Bezerra
  • 9,984
  • 4
  • 40
  • 48
1

Using external tool pandoc and pipe:

def to_markdown(df):
    from subprocess import Popen, PIPE
    s = df.to_latex()
    p = Popen('pandoc -f latex -t markdown',
              stdin=PIPE, stdout=PIPE, shell=True)
    stdoutdata, _ = p.communicate(input=s.encode("utf-8"))
    return stdoutdata.decode("utf-8")
Ilya Prokin
  • 684
  • 6
  • 11
0

sqlite3 returns Unicodes by default for TEXT fields. Everything was set up to work before you introduced the table() function from an external source (that you did not provide in your question).

The table() function has str() calls which do not provide an encoding, so ASCII is used to protect you.

You need to re-write table() not to do this, especially as you've got Unicode objects. You may have some success by simply replacing str() with unicode()

Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100