5

I have a non-standard CSV file that looks something like this:

x,y
1,"(5, 27, 4)"
2,"(3, 1, 6, 2)"
3,"(4, 5)"

Using pd.read_csv() leads to something that's not all that useful, because the tuples are not parsed. There are a existing answers that address this (1, 2), but because these tuples have heterogeneous lengths, those answers aren't entirely useful for the problem I'm having.

What I'd like to do is plot x vs y using the pandas plotting routines. The naive approach leads to an error because the tuples are stored as strings:

>>> # df = pd.read_csv('data.csv')
>>> df = pd.DataFrame({'x': [1, 2, 3],
                       'y': ["(5, 27, 4)","(3, 1, 6, 2)","(4, 5)"]})
>>> df.plot.scatter('x', 'y')
[...]
ValueError: scatter requires y column to be numeric

The result I'd hope for is something like this:

import numpy as np
import matplotlib.pyplot as plt
for x, y in zip(df['x'], df['y']):
    y = eval(y)
    plt.scatter(x * np.ones_like(y), y, color='blue')

enter image description here

Is there a straightforward way to create this plot directly from Pandas, by transforming the dataframe and using df.plot.scatter() (and preferably without using eval())?

jakevdp
  • 77,104
  • 11
  • 125
  • 160
  • If you're worried about the safety of eval, you could use something like this https://stackoverflow.com/questions/9763116/parse-a-tuple-from-a-string or this https://stackoverflow.com/a/19334399/8131703 – LangeHaare Oct 16 '17 at 17:31

2 Answers2

2

You could explode the df and plot

In [3129]: s = df.y.map(ast.literal_eval)

In [3130]: dff = pd.DataFrame({'x': df.x.repeat(s.str.len()).values,
                               'y': np.concatenate(s.values)})

In [3131]: dff
Out[3131]:
   x   y
0  1   5
1  1  27
2  1   4
3  2   3
4  2   1
5  2   6
6  2   2
7  3   4
8  3   5

And, plot

dff.plot.scatter('x', 'y')
Zero
  • 74,117
  • 18
  • 147
  • 154
1

You can use the .str accessor to extract integers, specifically .str.extractall:

# Index by 'x' to retain its values once we extract from 'y'
df = df.set_index('x')

# Extract integers from 'y'
df = df['y'].str.extractall(r'(\d+)')[0].astype('int64')

# Rename and reset the index (remove 'match' level, get 'x' as column)
df = df.rename('y').reset_index(level='match', drop=True).reset_index()

If you have floats instead of ints, just modify the regex and astype as appropriate.

This gives a DataFrame that looks like:

   x   y
0  1   5
1  1  27
2  1   4
3  2   3
4  2   1
5  2   6
6  2   2
7  3   4
8  3   5

And from there df.plot.scatter('x', 'y') should produce the expected plot.

root
  • 32,715
  • 6
  • 74
  • 87