Speed differences and errors when creating a pandas data frame

Question

I have been trying to convert data obtained from a Google sheet into a pandas dataframe.

My solution works:

header = data[0] # The first row
all_other_rows = data[1:] # All other rows
df = pd.DataFrame(header, columns=all_other_rows)

However, I don't understand why this code failed:

df = pd.DataFrame(data[0], columns=data[1:])

The error initially is "Shape of Values passed is (4, 1), indices imply (4, 4)", but even when resolved according to this answer, by adding brackets around data[0], it takes 5-10 minutes and the df loads incorrectly. What is the difference?

Extra details of this code:

The data was imported with this code:

  gc = gspread.authorize(credentials)
  wb = gc.open_by_key(spreadsheet_key)
  ws = wb.worksheet(worksheet_name)
  ws.get_all_values()

Here's sample data:

[['id', 'other_id', 'another_id', 'time_of_determination'],
 ['63409', '1', '', '2019-11-14 22:01:19.386903+00'],
 ['63499', '1', '8', '2019-11-14 22:01:19.386903+00'],
 ['63999', '1', '', '2019-11-14 22:01:19.386903+00'],
 ['69999', '1', '', '2019-11-14 22:01:19.386903+00']]

Could you please provide some sample data and a minimal working code. Otherwise, it would be too difficult to understand what went wrong. — Mohammad, Jan 14 '22 at 19:06
Because the data was imported from Google Sheets to Jupyter Notebook, it's hard to provide all working code. But if you set data = the data provided, you could start there. — Kat Cow, Jan 14 '22 at 19:16
@KatCow - To help you get better answers, make sure to provide a complete working code that people can easily replicate. Attaching different section of your code is not ideal as it is difficult to replicate and understand. If you review the code you've provided, some variable declarations are missing. See [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). Also, since you are using gspread, make sure to provide sample spreadsheet. — Nikko J., Jan 14 '22 at 20:27

Md Abrar Jahin · Answer 1 · 2022-01-14T19:24:10.610

You can download Google Sheet in MS Excel format. Then try:

import pandas as pd
df = pd.read_excel(excel_file)

You don't have to mention columns explicitly for this. read_excel will automatically detect columns.

Alternatively, I guess you wanted something like this. The issue is probably with your rows and columns selection:

data = [['id', 'other_id', 'another_id', 'time_of_determination'],
 ['63409', '1', '', '2019-11-14 22:01:19.386903+00'],
 ['63499', '1', '8', '2019-11-14 22:01:19.386903+00'],
 ['63999', '1', '', '2019-11-14 22:01:19.386903+00'],
 ['69999', '1', '', '2019-11-14 22:01:19.386903+00']]

import pandas as pd
header = data[0] # The first row
all_other_rows = data[1:] # All other rows
pd.DataFrame(all_other_rows, columns=header)

Output:

index    id   other_id  another_id  time_of_determination
0     63409          1              2019-11-14 22:01:19.386903+00
1     63499          1           8  2019-11-14 22:01:19.386903+00
2     63999          1              2019-11-14 22:01:19.386903+00
3     69999          1              2019-11-14 22:01:19.386903+00

Speed differences and errors when creating a pandas data frame

1 Answers1