3

Often people ask questions on Stack Overflow with an output of print(dataframe). It is convenient if one has a way of quickly loading the dataframe data into a pandas.dataframe object.

What is/are the most suggestible ways of loading a dataframe from a dataframe-string (which may or may not be properly formatted)?

Example-1

If you want to load the following string as a dataframe what would you do?

# Dummy Data
s1 = """
Client NumberOfProducts ID
A      1                2
A      5                1
B      1                2
B      6                1
C      9                1
"""

Example-2

This type is more similar to what you find in csv file.

# Dummy Data
s2 = """
Client, NumberOfProducts, ID
 A, 1, 2
 A, 5, 1
 B, 1, 2
 B, 6, 1
 C, 9, 1
"""

Expected Output

enter image description here

References

Note: The following two links do not address the specific situation presented in Example-1. The reason I think my question is not a duplicate is that I think one cannot load the string in Example-1 using any of the solutions already posted on those links (at the time of writing).

  1. Create Pandas DataFrame from a string. Note that pd.read_csv(StringIO(s1), sep), as suggested here, doesn't really work for Example-1. You get the following output.
    enter image description here

  2. This question was marked as a duplicate of two Stack Overflow links. One of them is the one above, which fails in addressing the case presented in Example-1. And the second one is . Among all the answers presented there, only one looked like it might work for Example-1, but it did not work.

# could not read the clipboard and threw error
pd.read_clipboard(sep='\s\s+')

Error Thrown:

PyperclipException: 
    Pyperclip could not find a copy/paste mechanism for your system.
    For more information, please visit https://pyperclip.readthedocs.org
halfer
  • 19,824
  • 17
  • 99
  • 186
CypherX
  • 7,019
  • 3
  • 25
  • 37
  • @yatu I would prefer to use Method-2 like you mentioned, however, it fails to load the data properly from **Example-1**. Which is why I opened this question and left a reference to a similar question, but not the same in the content of the question. – CypherX Oct 25 '19 at 09:07
  • First was tested with `df = pd.read_clipboard(sep='\s+')` and working nice for me – jezrael Oct 25 '19 at 09:31
  • I tried `pd.read_clipboard(sep='\s+')` and got the same error as `pd.read_clipboard(sep='\s\s+')`. I think it is system-configuration specific. – CypherX Oct 25 '19 at 09:37
  • ya, I agree, it seems problem is with your clipbourd in your pc/nb. – jezrael Oct 25 '19 at 10:11
  • So, `pd.read_clipboard(sep='\s+')` is a sys-config specific solution. However, **Method-1** that I suggested, did not have any such limitation. In that regard, I think our discussion on this question and the answer should be left here so others could refer to it in future. But I don't know what the policy is of stackoverflow on questions marked duplicate. Do they get closed and removed later? Or, they just remain marked as duplicate, so when people visit them, they right away get to see other relevant questions and answers? – CypherX Oct 25 '19 at 10:20
  • I think if disagree, you can try meta stackovefrlow, ask new question anf if comunity decide it is not dupe, I have no problem reopen. – jezrael Oct 25 '19 at 10:23
  • What I am asking is what happens when you mark a question as duplicate? – CypherX Oct 25 '19 at 10:24
  • not sure if understand, it is only marked, not more, not less. – jezrael Oct 25 '19 at 10:25
  • 1
    Ok. My concern was if as _marked-a-duplicate_ this question will get deleted in future. I just checked meta, and it looks like that won't happen. SO, am okay as well. https://meta.stackoverflow.com/questions/320522/how-are-duplicate-questions-deleted – CypherX Oct 25 '19 at 10:30

1 Answers1

2

I can suggest two methods to approach this problem.

Method-1

Process the string with regex and numpy to make the dataframe. What I have seen is that this works most of the time. This would for the case presented in "Example-1".

# Make Dataframe
import pandas as pd
import numpy as np
import re

# Make Dataframe
# s = s1
ncols = 3 # number_of_columns
ss = re.sub('\s+',',',s.strip())
sa = np.array(ss.split(',')).reshape(-1,ncols)
df = pd.DataFrame(dict((k,v) for k,v in zip(sa[0,:], sa[1:,].T)))
df

Method-2

Use io.StringIO to feed into pandas.read_csv(). But this would work if the separator is well defined. For instance, if your data looks similar to "Example-2". Source credit

import pandas as pd
from io import StringIO

# Make Dataframe
# s = s2
df = pd.read_csv(StringIO(s), sep=',')

Output

enter image description here

CypherX
  • 7,019
  • 3
  • 25
  • 37
  • I think first solution is problematic, because get alwyas strings, second is good, unfortunately it is dupe. – jezrael Oct 25 '19 at 09:02
  • @jezrael But if you try using the second one for the data in **Example-1**, it does not work. So, this question sheds light on how to handle such scenarios. Is it still a duplicate? – CypherX Oct 25 '19 at 09:05
  • I think yes, it is dupe. Maybe is possible find better dupe too. – jezrael Oct 25 '19 at 09:06
  • Maybe more general solution is added to dupes. – jezrael Oct 25 '19 at 09:10
  • I totally support for not cluttering the question stack. But, the solutions that have been marked as this question's duplicate, **they do not address the issue I raised for Example-1**. Am I missing something? – CypherX Oct 25 '19 at 09:18
  • hmmm,, what separator was used? – jezrael Oct 25 '19 at 09:20
  • I used `sep=','` and also `sep='\t'`. – CypherX Oct 25 '19 at 09:32