0

I have been trying to obtain the Statewise sheet from this public googlesheet link as a python dataframe.

The URL of this sheet is different from the URL of other examples for achieving the goal of getting sheet as dataframe, seen on this website.

URL is this : https://docs.google.com/spreadsheets/d/e/2PACX-1vSc_2y5N0I67wDU38DjDh35IZSIS30rQf7_NYZhtYYGU1jJYT6_kDx4YpF-qw0LSlGsBYP8pqM_a1Pd/pubhtml#

One standard way maybe the following,

import pandas

googleSheetId = '<Google Sheets Id>'
worksheetName = '<Sheet Name>'
URL = 'https://docs.google.com/spreadsheets/d/{0}/gviz/tq?tqx=out:csv&sheet={1}'.format(
    googleSheetId,
    worksheetName
)

df = pandas.read_csv(URL)
print(df)

But in the present URL I do not see a structure used here. Can someone help to clarify. Thanks.

ankit7540
  • 315
  • 5
  • 20
  • 1
    check this package : [gspread-pandas](https://pypi.org/project/gspread-pandas/) it might be helpful in connecting with google spreadsheets – sammywemmy Mar 25 '20 at 12:56
  • @sammywemmy : It apppears that gspread is for private google sheets. Thus requiring authorization. – ankit7540 Mar 25 '20 at 12:58
  • my bad. check out J-H 's solution then. Also check out this [solution](https://stackoverflow.com/a/19611857/7175713) – sammywemmy Mar 25 '20 at 12:59

2 Answers2

2

The Google spreadsheet is actually an html thing. So you should use read_html to load it into a list of pandas dataframes:

dfs = pd.read_html(url, encoding='utf8')

if lxml is available or, if you use BeautifulSoup4:

dfs = pd.read_html(url, flavor='bs4', encoding='utf8')

You will get a list of dataframes and for example dfs[0] is:

     0   1                                                  2                3
0    1  id                                             Banner  Number_Of_Times
1    2   1  Don't Hoard groceries and essentials. Please e...                2
2    3   2  Be compassionate! Help those in need like the ...                2
3    4   3  Be considerate : While buying essentials remem...                2
4    5   4  Going out to buy essentials? Social Distancing...                2
5    6   5  Plan ahead! Take a minute and check how much y...                2
6    7   6  Plan and calculate your essential needs for th...                2
7    8   7  Help out the elderly by bringing them their gr...                2
8    9   8  Help out your workers and domestic help by not...                2
9   10   9  Lockdown means LOCKDOWN! Avoid going out unles...                1
10  11  10           Panic mode : OFF! ❌ESSENTIALS ARE ON! ✔️                1
11  12  11  Do not panic! ❌ Your essential needs will be t...                1
12  13  12  Be a true Indian. Show compassion. Be consider...                1
13  14  13  If you have symptoms and suspect you have coro...                1
14  15  14  Stand Against FAKE News and WhatsApp Forwards!...                1
15  16  15  If you have any queries, Reach out to your dis...                1
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
0

You can use the following snippet:

from io import BytesIO
import requests

r = requests.get(URL)
data = r.content

df = pd.read_csv(BytesIO(data), index_col=0, error_bad_lines=False)
J-H
  • 1,795
  • 6
  • 22
  • 41
  • Already tried this. I get the error : `Error tokenizing data. C error: Expected 6 fields in line 48, saw 20` Here we do not specify the sheet hence I do not know which sheet is being obtained. – ankit7540 Mar 25 '20 at 13:00
  • You can try adding error_bad_lines=False parameter, but maybe something with your xsl file is not okay – J-H Mar 25 '20 at 13:02
  • I tried that as well. I am not accessing the correct worksheet hence some other data comes in. – ankit7540 Mar 25 '20 at 13:04