1

I have an excel file placed in Github and Python installed in AWS machine. I wanted to read the excel file from the AWS machine using Python script. Can you some one help me how to achieve this. So far i used below code to achieve this...

#Importing required Libraries
import pandas as pd
import xlwt
import xlrd

#Formatting WLM data
URL= 'https://github.dev.global.tesco.org/DotcomPerformanceTeam/Sample-WLM/blob/master/LEGO_LIVE_FreshOrderStableProfile_2019_v0.1.xlsx'
data = pd.read_excel(r"URl", sheet_name='WLM', dtype=object)

When i executed this i am getting below error

IOError: [Errno 2] No such file or directory: 'URl'
SG131712
  • 135
  • 3
  • 13
  • 3
    typo: `data = pd.read_excel(URL, sheet_name='WLM', dtype=object)`? – Jan Garaj Feb 07 '19 at 11:26
  • I guess the problem is that you need to authenticate to get the excel file. This may help: https://stackoverflow.com/questions/33039327/handling-http-authentication-when-accesing-remote-urls-via-pandas – Rob Feb 07 '19 at 12:28
  • HI Jan Garaj, I dont think it is because of typo. I tried to do without quotes also but gave below error `data = pd.read_excel(URl, sheet_name='WLM', dtype=object) NameError: name 'URl' is not defined` – SG131712 Feb 07 '19 at 15:59

2 Answers2

1

You can use de Wget command to download the file from GitHub. The key here is to use the raw version link, otherwise you will download an html file. To get the raw link, click on the file you uploaded on GitHub, then right-click on the Raw button and choose the save path or copy path. Finally you can use it to download the file, and then read it with pd.read_excel("Your Excel file URL or disk location"). Example:

#Raw link: https://raw.github.com/<username>/<repo>/<branch>/Excelfile.xlsx

!wget --show-progress --continue -O /content/Excelfile.xlsx https://raw.github.com/<username>/<repo>/<branch>/Excelfile.xlsx

df = pd.read_excel("content/Excelfile.xlsx")

Note: this example applies for Colab if you are using a local environment do not use the exclamation mark. You can also find more ideas here: Download single files from GitHub

Javiers
  • 351
  • 3
  • 7
1

These instruction are for a CSV file but should work for an excel file as well.

If the repository is private, you might need to create a personal access token as described in "Creating a personal access token" (pay attention to the permissions especially if the repository belongs to an organisation).

  1. Click the "raw" button in GitHub. Here below is an example from https://github.com/udacity/machine-learning/blob/master/projects/boston_housing/housing.csv:

enter image description here

If the repo is private and there is no ?token=XXXX at the end of the url (see below), you might need to create a personal access token and add it at the end of the url. I can see from your URL that you need to configure your access token to work with SAML SSO, please read About identity and access management with SAML single sign-on and Authorizing a personal access token for use with SAML single sign-on

  1. Copy the link to the file from the browser navigation bar, e.g.:
https://raw.githubusercontent.com/udacity/machine-learning/master/projects/boston_housing/housing.csv
  1. Then use code:
import pandas as pd

url = (
    "https://raw.githubusercontent.com/udacity/machine-learning/master"
    "/projects/boston_housing/housing.csv"
)

df = pd.read_csv(url)

In case your repo is private, the link copied would have a token at the end:

https://raw.githubusercontent.com/. . ./my_file.csv?token=XXXXXXXXXXXXXXXXXX
Serg
  • 121
  • 1
  • 9