0

I have data at some given temperature [30, 40,45...].

Is it possible to generate synthetic data for other temperatures using scikit-learn or any other library?

I am using the existing data and the python code to get the mean plot.

#!/usr/bin/env python
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats

data = pd.read_csv("trialdata.csv")
#  data[(np.abs(stats.zscore(data)) < 3).any(axis=1)]
#  print(data)
data = data.groupby("Temp").mean()
data["Temp"] = [30, 40, 45, 50, 55, 60]
print(data)
data.plot.line(y="Er", x="Temp", use_index=True, style="o-")
plt.ylabel("Er")
plt.tight_layout()
plt.show()

I want to generate data for other temperatures eg [35, 65,70] etc for machine learning training set.

BaRud
  • 3,055
  • 7
  • 41
  • 89
  • Is it a cross sectional, or time-series data? If it is cross sectional, try fitting a parametric distribution to the data and sample from that e.g. [fitting gamma distribution](https://stackoverflow.com/questions/2896179/fitting-a-gamma-distribution-with-python-scipy). But if it is time-series try [AutoARIMA](https://github.com/Nixtla/statsforecast) – Wakeme UpNow Feb 13 '23 at 02:34

1 Answers1

1

In the simplest case that you want to create some uniform synthetic data with no constraints:

temp_pool = np.arange(30, 55, 5) # Example possible temperatures: [30, 35, 40, 45, 50]
df_synthetic = pd.DataFrame({'Temp': np.random.choice(temp_pool, size=100)})

Using numpy.random.choice you can create random sample from a given population. If you want to get a skewed sample, you can use parameter p:

df_synthetic = pd.DataFrame({'Temp': np.random.choice(temp_pool,
                                                      size=100,
                                                      p=[0.5, 0.2, 0.15, 0.1, 0.05])})

This way, temperature values will approximate following frequencies:

  • 30: 0.5 or 50%
  • 35: 0.2 or 20%
  • 40: 0.15 or 15%
  • etc.

Finally, if you want your data to follow a specific distribution, you can utilize any of the scipy distributions.

lezaf
  • 482
  • 2
  • 10