Is there a better way to group data by a specified column with numpy without pandas?

Question

The solution in that post assumes the data is ordered by key, which is Different to my case.

If I order the data before apply that solution, there is no more condense or efficiency than what I've already achieved.

The dataset './melb_data.csv' comes from kaggle.

This code is to draw a horizontal plotbox.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict

data = np.genfromtxt('melb_data.csv', 
                     delimiter=',', names = True, 
                     dtype=None, encoding=None)

tem1 = defaultdict(list)
for key, value in zip(data['Regionname'], data['Price']):
    tem1[key].append(value)

data = defaultdict(list)
for key, value in tem1.items():
    data["Regionname"].append(key)
    data["Price"].append(value)

fig, ax = plt.subplots()
ax.boxplot(data['Price'], labels=data['Regionname'],vert=False)
plt.show()

There are 2 for loops in the code to group price by Regionname. I'm concerned if there is a better way to do the groupby, like some numpy methods?

I know it is easier to use pandas to do this, but for some reason, I have to do this without pandas.

[`itertools`](https://docs.python.org/3/library/itertools.html#module-itertools) has a `groupby` function and is part of the standard library. — SpghttCd, Oct 07 '19 at 22:39
@Valentino it seems that the solution in that post assumes the data is ordered by key, which is Different to my case. if you order the data first before apply that solution, there is no more condense or efficiency than what I already achieve. Would you please take you tag back? — , Oct 07 '19 at 22:59
@ImportanceOfBeingErnest The solution in that post **assumes** the data is ordered by key, which is Different to my case. If I order the data before apply that solution, there is no more condense or efficiency than what I've already achieved. — , Oct 07 '19 at 23:01

score 0 · Accepted Answer · answered Oct 07 '19 at 22:33

You can do what you are looking for by using the set() constructor and numpy.where:

import numpy as np
import matplotlib.pyplot as plt

data = np.genfromtxt('melb_data.csv', 
                     delimiter=',', names=True, 
                     dtype=None, encoding=None)

processed_data          = {'Regionname': set(data['Regionname'])}
processed_data['Price'] = [data['Price'][np.where(data['Regionname'] == rn)]
                              for rn in processed_data['Regionname']]

fig, ax = plt.subplots()
ax.boxplot(processed_data['Price'], labels=processed_data['Regionname'],vert=False)
plt.show()

The set() constructor returns a set of unique values of Regionname. numpy.where allows you to grab the indices where the Regionname matches. Note carefully that the list comprehension for processed_data['Price'] uses the Regionname column with duplicates for the 'where grab', and the Regionname column without duplicates for the iteration. This is because we want to grab data['Price'] using the original indices.

Enjoy!

Is there a better way to group data by a specified column with numpy without pandas?

1 Answers1