Create a DataFrame from a set of ranges with mixed numerical types

Question

I have a list of 5 ranges, that I want to create a DataFrame from. The resultant DataFrame should have 10 rows & 5 columns. The values of the columns will be random numbers in the given range.

The given ranges are a mix of integers and floats, i.e. [1,31] represents a range of integers, [4, 172.583333] represents a range of floats.

The code below works for outputs of either integers or floats only.

How can I have an output of a mix of integers and floats together? I.e. column A holds integers, column B holds floats, column C also holds floats, D and E hold integers.

Thank you.

import numpy as np
import pandas as pd

min_max = [
[1, 31],
[4, 172.583333],
[0, 88.50561],
[4, 297],
[3, 37]]

for a, b in min_max:
    df = pd.DataFrame(np.random.randint(a,b,size=(10, 5)), columns=list('ABCDE'))   # to generate intergers only
    df = pd.DataFrame(np.random.uniform(a,b,size=(10, 5)), columns=list('ABCDE'))   # to generate floats only

What’s the exact problem here? You know how to create the ranges, and how to create the DataFrame, right? How did you end up in this situation? Having to check the types is probably poor design. — AMC, Dec 08 '19 at 00:16

Oliver W. · Accepted Answer · 2019-12-08T00:08:45.333

1

Create a separate pd.Series based on the datatype that you want. In the example below, this is inferred by checking whether the minimum or maximum is of the float type. There are other ways to do that, like explicitly adding the datatype you want.

Then, with the list of Series, create a DataFrame.

import numpy as np
import pandas as pd

min_max = ([1, 31], [4, 172.583333], [0, 88.50561], [4, 297], [3, 37])


def make_series(low, high, name):
    if any(isinstance(_, float) for _ in (low, high)):
        func = np.random.uniform
    else:
        func = np.random.randint
    return pd.Series(func(low, high, size=(10,)), name=name)


pd.concat([make_series(lo, hi, name) for (lo, hi), name in zip(min_max, "ABCDE")],
          axis=1)

edited Dec 08 '19 at 00:08

answered Dec 07 '19 at 23:51

Oliver W.

13,169
3
37
50

W, superb! thank you for the sharing of knowledge and perfect solution! – Mark K Dec 08 '19 at 00:05
While assigning the functions to a variable is a neat trick, what’s the benefit here? I see only downsides. – AMC Dec 08 '19 at 00:21
2

@AlexanderCécile reduced typing. If you like copy-pasta, go ahead and spell it out. However, be aware that _everything_ in Python is an object and passing around functions is commonly done. In my experience, I see only benefits. Feel free to spell out the downsides. – Oliver W. Dec 08 '19 at 00:23
I agree that it does save some typing, that's true. I must admit I'm quite surprised that someone with 10k rep considers "reducing typing" to be of some importance. I'm doubly surprised that you wouldn't even consider the possibility that this might make the code more difficult to understand. Speaking of difficult to parse, what is going on with `any(isinstance(_, float) for _ in (low, high))`? Is that a variable named `_`, being used in the expression, the **exact opposite** of the conventional purpose of the underscore? – AMC Dec 08 '19 at 02:25
(continued) Is that to save on typing, or is it to reduce memory consumption by using shorter names? Both? Are the generator expression, tuple construction, and `any()` call also to reduce the amount of typing? The meaning is as obvious as `isinstance(low, float) or isinstance(high, float)`, right? Is this the Python equivalent of the [enterprise java meme](https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpriseEdition)? – AMC Dec 08 '19 at 02:36
Jokes aside though I have nothing against higher-order functions, I'm a fan of functional programming. Except that this particular case has very little to do with that, no? It's literally just to save 17 whole characters. I take particular offence with this: _If you like copy-pasta, go ahead and spell it out._ It might be because I'm just a n00b, but I use an IDE which offers code completion... – AMC Dec 08 '19 at 02:43
Oh my god, I just noticed the `min_max = ([1, 31], [4, 172.583333], [0, 88.50561], [4, 297], [3, 37])`. I feel ridiculous, my apologies. I should have realized that one of us is drunk, high, or both. That certainly explains why it looks like lists are being used to hold what we know will always be two values, and that a tuple is being used as a collection for a variable number of elements. – AMC Dec 08 '19 at 02:48
Alexander,you’re right about the underscore. It’s [commonly used](https://stackoverflow.com/questions/5893163/what-is-the-purpose-of-the-single-underscore-variable-in-python) as a throw-away variable, but here I’m actually using it. Comes from my involvement with Scala lately. Getting back to the reduction on typing, it's not about a few characters, it's about the possibility to modify the code more easily later on. If the size had to be changed to 20 instead of 10, you’d need to change code in 2 places. For a small example like this, it might not be useful, but it’s a good habit to get into. – Oliver W. Dec 08 '19 at 02:53
@OliverW. Building good habits is certainly a strong reason. I think the variable name is also bugging me. Something like `rand_range_func` or `rand_gen_func` would make things immediately clear. Changing the size shouldn't be an issue, since the function takes it as a parameter, right? ;) :p – AMC Dec 08 '19 at 02:57
2

@AlexanderCécile care to join me in [chat](https://chat.stackoverflow.com/rooms/203828/q59231193) for an extended discussion? – Oliver W. Dec 08 '19 at 03:14

AMC · Answer 2 · 2019-12-08T04:24:01.253

This is a tweaked version of the solution by Oliver W.. He deserves full credit for the answer.

import numpy as np
import pandas as pd

min_max = [(1, 31), (4, 172.583333), (0, 88.50561), (4, 297), (3, 37)]


def get_rand_range(low, high, size):
    if isinstance(low, float) or isinstance(high, float):
        return np.random.uniform(low, high, size)
    else:
        return np.random.randint(low, high, size)


cols_dict = dict(zip('ABCDE', (get_rand_range(low, high, 10) for low, high in min_max)))
df_1 = pd.DataFrame(data=cols_dict)

Bear in mind that uniform draws numbers from the interval [low, high), whereas randint uses [low, high].

thank you for sharing the knowledge, help and contribution to the question. — Mark K, Dec 09 '19 at 02:40

Create a DataFrame from a set of ranges with mixed numerical types

2 Answers2