Create Pandas DataFrame from a string

Question

In order to test some functionality I would like to create a DataFrame from a string. Let's say my test data looks like:

TESTDATA="""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
"""

What is the simplest way to read that data into a Pandas DataFrame?

score 797 · Accepted Answer · edited Jun 17 '19 at 14:43

797

A simple way to do this is to use StringIO.StringIO (python2) or io.StringIO (python3) and pass that to the pandas.read_csv function. E.g:

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd

TESTDATA = StringIO("""col1;col2;col3
    1;4.4;99
    2;4.5;200
    3;4.7;65
    4;3.2;140
    """)

df = pd.read_csv(TESTDATA, sep=";")

edited Jun 17 '19 at 14:43

Gustavo Lopes

3,794
4
17
57

answered Mar 24 '14 at 09:21

Emil L

20,219
3
44
65

4

FYI - `pd.read_table()` is an equivalent function, just slightly better nomenclature: `df = pd.read_table(TESTDATA, sep=";")`. – wkzhu Dec 06 '17 at 23:17
9

@AntonvBR Noted that one could use `pandas.compat.StringIO`. That way we don't have to import `StringIO` separately. However the `pandas.compat` package is considered private according to http://pandas.pydata.org/pandas-docs/stable/api.html?highlight=compat so leaving the answer as is for now. – Emil L Dec 12 '17 at 06:04
Time to sort out which import: [Should we use pandas.compat.StringIO or Python 2/3 StringIO?](https://stackoverflow.com/questions/50283292/should-we-use-pandas-compat-stringio-or-python-2-3-stringio) – smci May 11 '18 at 00:30
If you create TESTDATA with `df.to_csv(TESTDATA)`, use `TESTDATA.seek(0)` – user3226167 Oct 24 '19 at 08:30
I receive 'Error tokenizing data. C error: Expected 2 fields in line 26, saw 12\n',) – gdm Jul 31 '20 at 08:49
This is what I needed to do when uploading CSV files in colab. In order to make a dataframe out of the CSV, I used this method – Joshua Swain Jan 05 '21 at 13:12
@gdm The "Expected fields error" is usually due to inconsistent number of fields in each row, or wrong sep/delimiter. – skytaker Mar 26 '21 at 16:38
1

@Asclepius the suggestion to use `pandas.compat` only works for versions before pandas 0.25, but will now raise ImportError. – Mike T Dec 22 '21 at 00:11

Shaurya Uppal · Answer 2 · 2019-04-14T02:19:25.633

57

Split Method

data = input_string
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
print(df)

edited Apr 14 '19 at 02:19

answered Feb 21 '19 at 17:27

Shaurya Uppal

3,410
31
31

8

If you want the first line to be used for column names, change the 2nd line to this: `df = pd.DataFrame([x.split(';') for x in data.split('\n')[1:]], columns=[x for x in data.split('\n')[0].split(';')])` – Mabyn Oct 18 '19 at 01:34
4

This is wrong, since on CSV files the newline (\n) character can be part of a field. – Antonio Ercole De Luca Apr 03 '20 at 13:13
4

This is not very robust, and most people would be better with the accepted answer. There is a very partial list of things that can go wrong with this at http://thomasburette.com/blog/2014/05/25/so-you-want-to-write-your-own-CSV-code/ – DanB May 15 '20 at 17:21
1

This is a weak and unnecessary approach. This circumvents the built-in `pd.read_csv()` call which handles any CSV and has been battle-tested in favor of a naive couple of string splits that don't handle quoting and are liable to fail in many cases. It's also more verbose. Please use the CSV parser Pandas kindly gives you! – ggorlen Mar 16 '23 at 04:59

score 54 · Answer 3 · edited Oct 20 '22 at 08:33

54

In one line, but first import io

import pandas as pd
import io   

TESTDATA="""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
"""

df = pd.read_csv(io.StringIO(TESTDATA), sep=";")
print(df)

edited Oct 20 '22 at 08:33

mirekphd

4,799
3
38
59

answered Nov 26 '20 at 09:10

user3810512

587
4
3

12

What is the difference between this and the accepted answer? Except you move io operation to read_csv, which makes no difference... Please always check if similar answer is not posted already, redundancy is unnecessary. – Ruli Nov 26 '20 at 09:35
4

I like this answer because it is explained with one example in one line. – eduardosufan Jun 06 '22 at 18:19

score 26 · Answer 4 · answered Nov 30 '18 at 08:17

A quick and easy solution for interactive work is to copy-and-paste the text by loading the data from the clipboard.

Select the content of the string with your mouse:

In the Python shell use read_clipboard()

>>> pd.read_clipboard()
  col1;col2;col3
0       1;4.4;99
1      2;4.5;200
2       3;4.7;65
3      4;3.2;140

Use the appropriate separator:

>>> pd.read_clipboard(sep=';')
   col1  col2  col3
0     1   4.4    99
1     2   4.5   200
2     3   4.7    65
3     4   3.2   140

>>> df = pd.read_clipboard(sep=';') # save to dataframe

Not good for reproducibility, but otherwise a pretty neat solution! — Mabyn, Oct 18 '19 at 01:27

Asclepius · Answer 5 · 2020-06-24T22:02:42.070

This answer applies when a string is manually entered, not when it's read from somewhere.

A traditional variable-width CSV is unreadable for storing data as a string variable. Especially for use inside a .py file, consider fixed-width pipe-separated data instead. Various IDEs and editors may have a plugin to format pipe-separated text into a neat table.

Using `read_csv`

Store the following in a utility module, e.g. util/pandas.py. An example is included in the function's docstring.

import io
import re

import pandas as pd


def read_psv(str_input: str, **kwargs) -> pd.DataFrame:
    """Read a Pandas object from a pipe-separated table contained within a string.

    Input example:
        | int_score | ext_score | eligible |
        |           | 701       | True     |
        | 221.3     | 0         | False    |
        |           | 576       | True     |
        | 300       | 600       | True     |

    The leading and trailing pipes are optional, but if one is present,
    so must be the other.

    `kwargs` are passed to `read_csv`. They must not include `sep`.

    In PyCharm, the "Pipe Table Formatter" plugin has a "Format" feature that can 
    be used to neatly format a table.

    Ref: https://stackoverflow.com/a/46471952/
    """

    substitutions = [
        ('^ *', ''),  # Remove leading spaces
        (' *$', ''),  # Remove trailing spaces
        (r' *\| *', '|'),  # Remove spaces between columns
    ]
    if all(line.lstrip().startswith('|') and line.rstrip().endswith('|') for line in str_input.strip().split('\n')):
        substitutions.extend([
            (r'^\|', ''),  # Remove redundant leading delimiter
            (r'\|$', ''),  # Remove redundant trailing delimiter
        ])
    for pattern, replacement in substitutions:
        str_input = re.sub(pattern, replacement, str_input, flags=re.MULTILINE)
    return pd.read_csv(io.StringIO(str_input), sep='|', **kwargs)

Non-working alternatives

The code below doesn't work properly because it adds an empty column on both the left and right sides.

df = pd.read_csv(io.StringIO(df_str), sep=r'\s*\|\s*', engine='python')

As for read_fwf, it doesn't actually use so many of the optional kwargs that read_csv accepts and uses. As such, it shouldn't be used at all for pipe-separated data.

I found (by trial&error) that `read_fwf` takes more of `read_csv`s arguments than is documented, but it's true that [some have no effect](https://stackoverflow.com/q/59757478/974555). — gerrit, Jan 20 '20 at 16:12

Hunaphu · Answer 6 · 2023-07-01T14:48:46.483

4

Object: Take string make dataframe.

Solution

def str2frame(estr, sep = ',', lineterm = '\n', set_header = True):
    dat = [x.split(sep) for x in estr.strip(lineterm).split(lineterm)]
    df = pd.DataFrame(dat)
    if set_header:
        df = df.T.set_index(0, drop = True).T # flip, set ix, flip back
    return df

Example

estr = """
sym,date,strike,genus
APPLE,20MAY20,50.0,Malus
ORANGE,22JUL20,50.0,Rutaceae
"""

df = str2frame(estr)

print(df)
0     sym     date strike     genus
1   APPLE  20MAY20   50.0     Malus
2  ORANGE  22JUL20   50.0  Rutaceae

edited Jul 01 '23 at 14:48

answered Jun 18 '21 at 09:35

Hunaphu

589
10
11

Please use descriptive variable names and not estr, dat, cdf. – problemofficer - n.f. Monica Sep 07 '22 at 07:03
The style is intentional; feel free to copy this answer and use your own notation. – Hunaphu Nov 24 '22 at 16:24
[Same remark as this answer](https://stackoverflow.com/a/54812940/6243352). Please don't reinvent the wheel, just use `pd.read_csv()`. – ggorlen Mar 16 '23 at 05:01
If you have one, a better solution depending only on pandas would be appreciated. – Hunaphu Mar 18 '23 at 13:49

score 0 · Answer 7 · answered Nov 20 '22 at 08:51

0

Emample:

text = [ ['This is the NLP TASKS ARTICLE written by Anjum**'] ,['IN this article I”ll be explaining various DATA-CLEANING techniques '], ['So stay tuned for FURther More && '],['Nah I dont think he goes to usf ; he lives around']]
df = pd.DataFrame({'text':text})

Output

answered Nov 20 '22 at 08:51

Engr. Khuram Shahzad

380
3
13

This has nothing to do with the question. OP wants to read a CSV from a string, completely missing here. – ggorlen Mar 16 '23 at 05:02
@ggorlen. First see the question: "Create Pandas DataFrame from a string" and then see my answer. lol – Engr. Khuram Shahzad Apr 04 '23 at 19:53
Sorry, I did that and don't see what you mean. There's no CSV string in your answer. You've hardcoded a list, completely bypassing a conversion step that all of the other answers here address. The question has nothing to do with converting lists to dataframes. Even if you were going to hardcode lists, this is a poor solution because it adds extra single-element inner list wrappers on each of the strings for no apparent reason, making the `text` column an object rather than a string datatype. – ggorlen Apr 04 '23 at 19:54

Create Pandas DataFrame from a string

7 Answers7

Using `read_csv`

Non-working alternatives

Solution

Example

Linked

Related

Create Pandas DataFrame from a string

7 Answers7

Using read_csv

Non-working alternatives

Solution

Example

Linked

Related

Using `read_csv`