2

I hope that everybody is staying safe amid the COVID-19 pandemic. I am new to Python and have a quick question about importing data from a CSV into Python for conducting a simple logistic regression analysis where the dependent variable is binary, and the independent variable is continuous.

I imported a CSV file, then wished to use one variable (Active) as the independent variable and another variable (Smoke) as the response variable. I am able to load the CSV file into Python but each time I try to generate a logistic regression model to predict Smoke from Exercise, I get an error that Exercise has to be reshaped into one column (two dimensional), as it is currently one dimensional.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
data = pd.read_csv('Pulse.csv') # Read the data from the CSV file
x = data['Active'] # Load the values from Exercise into the independent variable
x = np.array.reshape(-1,1)
y = data['Smoke'] # The dependent variable is set as Smoke

I keep receiving the following error message:

ValueError: Expected 2D array, got 1D array instead: array=[ 97. 82. 88. 106. 78. 109. 66. 68. 100. 70. 98. 140. 105. 84. 134. 117. 100. 108. 76. 86. 110. 65. 85. 80. 87. 133. 125. 61. 117. 90. 110. 68. 102. 67. 112. 86. 85. 66. 73. 85. 110. 97. 93. 86. 80. 96. 74. 124. 78. 93. 80. 80. 92. 69. 82. 88. 74. 74. 75. 120. 105. 104. 99. 113. 67. 125. 133. 98. 80. 91. 76. 78. 94. 150. 92. 96. 68. 82. 102. 69. 65. 84. 86. 84. 116. 88. 65. 101. 89. 128. 68. 90. 80. 80. 98. 90. 82. 97. 90. 98. 88. 94. 92. 96. 80. 66. 110. 87. 88. 94. 96. 89. 74. 111. 81. 98. 99. 65. 95. 127. 76. 102. 88. 125. 72. 76. 112. 69. 101. 72. 112. 81. 90. 96. 66. 114. 71. 75. 102. 138. 85. 80. 107. 119. 98. 95. 95. 76. 96. 102. 82. 99. 80. 83. 102. 102. 106. 79. 80. 79. 110. 144. 80. 97. 60. 80. 108. 107. 51. 68. 80. 80. 60. 64. 87. 110. 110. 82. 154. 139. 86. 95. 112. 120. 79. 64. 84. 65. 60. 79. 79. 70. 75. 107. 78. 74. 80. 121. 120. 96. 75. 106. 88. 91. 98. 63. 95. 85. 83. 92. 81. 89. 103. 110. 78. 122. 122. 71. 65. 92. 93. 88. 90. 56. 95. 83. 97. 105. 82. 102. 87. 81.]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Below is the entire, updated code with errors (04/12/2020): *I was unable to enter the error log into this document so I have copied and pasted it into this public Google Document: https://docs.google.com/document/d/1vtrj6Znv54FJ4Zvv211TQvvCN6Ac5LDaOfvHicQn0nU/edit?usp=sharing

Also, here is the CSV file: https://drive.google.com/file/d/1g_-vPNklxRn_3nlNPsR-IOflLfXSzFb1/view?usp=sharing

scikit-learn
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
data = pd.read_csv('Pulse.csv')
x = data['Active']
y = data['Smoke']
lr = LogisticRegression().fit(x.values.reshape(-1,1), y)
p_pred = lr.predict_proba(x.values)
y_pred = lr.predict(x.values)
score_ = lr.score(x.values,y.values)
conf_m = confusion_matrix(y.values,y_pred.values)
report = classification_report(y.values,y_pred.values)
confusion_matrix(y, lr.predict(x))    
cm = confusion_matrix(y, lr.predict(x))
fig, ax = plt.subplots(figsize = (8,8))
ax.imshow(cm)
ax.grid(False)
ax.xaxis.set(ticks=(0,1), ticklabels = ('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0,1), ticklabels = ('Actual 0s', 'Actual 1s'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
    for j in range(2):
        ax.text(j,i,cm[i,j],ha='center',va='center',color='red', size='45')
plt.show()
print(classification_report(y,model.predict(x)))
cielo_azzuro
  • 89
  • 1
  • 9
  • Try without this line `x = np.array.reshape(-1,1)` – ManojK Apr 11 '20 at 15:40
  • Thank you for the suggestion. I tried it but the result was the same: "ValueError: Expected 2D array, got 1D array instead." – cielo_azzuro Apr 11 '20 at 15:52
  • Can you add complete code which also includes the model fitting part? – ManojK Apr 11 '20 at 15:54
  • Dear ManojK, thank you for your patience and copied support. I have updated this question with the entire code at my disposal, and I also copied and pasted the error log (which was not accepted here when I tried to submit it) into a Google Document. Any suggestion would be much appreciated. – cielo_azzuro Apr 12 '20 at 22:54
  • Please check my answer below. – ManojK Apr 13 '20 at 06:52

2 Answers2

0

Try this:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

data = pd.read_csv('Pulse.csv') # Read the data from the CSV file
x = data['Active'] # Load the values from Exercise into the independent variable
y = data['Smoke'] # The dependent variable is set as Smoke

lr = LogisticRegression().fit(x.values.reshape(-1,1), y)
  • Thank you. After entering the provided command, I received the following error: `p_pred = lr.predict_proba(x) y_pred = lr.predict(x) score_ = lr.score(x,y) conf_m = confusion_matrix(y,y_pred) report = classification_report(y,y_pred)` – cielo_azzuro Apr 11 '20 at 17:40
  • that isn't error, is a code. Note that you must use x.values.reshape(-1,1) instead x – Cristian Contrera Apr 11 '20 at 18:13
  • Dear Cristian, many thanks for the continued support. I tried your suggestion but was unable to circumvent the error. I have updated this question with the entire code, as well as the error log that is generated after trying to run it. Any suggestions would be appreciated, thank you. – cielo_azzuro Apr 12 '20 at 22:55
0

Below code should work:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
data = pd.read_csv('Pulse.csv')
x = pd.DataFrame(data['Smoke'])
y = data['Smoke']
lr = LogisticRegression()
lr.fit(x,y)
p_pred = lr.predict_proba(x)
y_pred = lr.predict(x)
score_ = lr.score(x,y)
conf_m = confusion_matrix(y,y_pred)
report = classification_report(y,y_pred)

print(score_)
0.8836206896551724

print(conf_m)
[[204   2]
 [ 25   1]]
ManojK
  • 1,570
  • 2
  • 9
  • 17
  • Dear ManojK, many thanks for your patience and continued support... Unfortunately I was unable to get it to work. Here is a public link to the PDF: https://drive.google.com/file/d/1_1FUHuLWh2KsxjbTXAdlx4lZHxPcrHEc/view?usp=sharing and here is the CSV file: http://www.stat2.org/datasets/Pulse.csv Sincerely, ciel_azzuro – cielo_azzuro Apr 13 '20 at 21:53
  • See my updated code, it is working fine now, just changed this line: `x = pd.DataFrame(data['Smoke'])` it was giving errors as `x` was a `Series` now it is converted to a `DataFrame`. – ManojK Apr 14 '20 at 07:33
  • 1
    I am immensely grateful for your valuable time and insight. The output of the analyses matched with that of another computational software (SPSS). Thank you. – cielo_azzuro Apr 14 '20 at 20:37
  • I have just clicked the green check. If there is any further action, please let me know. – cielo_azzuro Apr 15 '20 at 14:17
  • 1
    Dear ManojK, I just wanted to let you know that I have been referencing this page for your suggestion for future logistic regression models and your suggestion continues to be immensely useful. Thanks again. – cielo_azzuro Apr 30 '20 at 00:15
  • 1
    Great, please let me know if you have any more questions. – ManojK Apr 30 '20 at 07:23
  • Would you by chance have any experience with simple logistic regression in Statsmodels? https://stackoverflow.com/questions/61560569/simple-logistic-regression-with-statsmodels-adding-an-intercept-and-visualizing – cielo_azzuro May 05 '20 at 21:38
  • I have answered on the question. – ManojK May 06 '20 at 08:42