I hope that everybody is staying safe amid the COVID-19 pandemic. I am new to Python and have a quick question about importing data from a CSV into Python for conducting a simple logistic regression analysis where the dependent variable is binary, and the independent variable is continuous.
I imported a CSV file, then wished to use one variable (Active) as the independent variable and another variable (Smoke) as the response variable. I am able to load the CSV file into Python but each time I try to generate a logistic regression model to predict Smoke from Exercise, I get an error that Exercise has to be reshaped into one column (two dimensional), as it is currently one dimensional.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
data = pd.read_csv('Pulse.csv') # Read the data from the CSV file
x = data['Active'] # Load the values from Exercise into the independent variable
x = np.array.reshape(-1,1)
y = data['Smoke'] # The dependent variable is set as Smoke
I keep receiving the following error message:
ValueError: Expected 2D array, got 1D array instead: array=[ 97. 82. 88. 106. 78. 109. 66. 68. 100. 70. 98. 140. 105. 84. 134. 117. 100. 108. 76. 86. 110. 65. 85. 80. 87. 133. 125. 61. 117. 90. 110. 68. 102. 67. 112. 86. 85. 66. 73. 85. 110. 97. 93. 86. 80. 96. 74. 124. 78. 93. 80. 80. 92. 69. 82. 88. 74. 74. 75. 120. 105. 104. 99. 113. 67. 125. 133. 98. 80. 91. 76. 78. 94. 150. 92. 96. 68. 82. 102. 69. 65. 84. 86. 84. 116. 88. 65. 101. 89. 128. 68. 90. 80. 80. 98. 90. 82. 97. 90. 98. 88. 94. 92. 96. 80. 66. 110. 87. 88. 94. 96. 89. 74. 111. 81. 98. 99. 65. 95. 127. 76. 102. 88. 125. 72. 76. 112. 69. 101. 72. 112. 81. 90. 96. 66. 114. 71. 75. 102. 138. 85. 80. 107. 119. 98. 95. 95. 76. 96. 102. 82. 99. 80. 83. 102. 102. 106. 79. 80. 79. 110. 144. 80. 97. 60. 80. 108. 107. 51. 68. 80. 80. 60. 64. 87. 110. 110. 82. 154. 139. 86. 95. 112. 120. 79. 64. 84. 65. 60. 79. 79. 70. 75. 107. 78. 74. 80. 121. 120. 96. 75. 106. 88. 91. 98. 63. 95. 85. 83. 92. 81. 89. 103. 110. 78. 122. 122. 71. 65. 92. 93. 88. 90. 56. 95. 83. 97. 105. 82. 102. 87. 81.]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Below is the entire, updated code with errors (04/12/2020): *I was unable to enter the error log into this document so I have copied and pasted it into this public Google Document: https://docs.google.com/document/d/1vtrj6Znv54FJ4Zvv211TQvvCN6Ac5LDaOfvHicQn0nU/edit?usp=sharing
Also, here is the CSV file: https://drive.google.com/file/d/1g_-vPNklxRn_3nlNPsR-IOflLfXSzFb1/view?usp=sharing
scikit-learn
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
data = pd.read_csv('Pulse.csv')
x = data['Active']
y = data['Smoke']
lr = LogisticRegression().fit(x.values.reshape(-1,1), y)
p_pred = lr.predict_proba(x.values)
y_pred = lr.predict(x.values)
score_ = lr.score(x.values,y.values)
conf_m = confusion_matrix(y.values,y_pred.values)
report = classification_report(y.values,y_pred.values)
confusion_matrix(y, lr.predict(x))
cm = confusion_matrix(y, lr.predict(x))
fig, ax = plt.subplots(figsize = (8,8))
ax.imshow(cm)
ax.grid(False)
ax.xaxis.set(ticks=(0,1), ticklabels = ('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0,1), ticklabels = ('Actual 0s', 'Actual 1s'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
for j in range(2):
ax.text(j,i,cm[i,j],ha='center',va='center',color='red', size='45')
plt.show()
print(classification_report(y,model.predict(x)))