Read a csv file from bitbucket using Python and convert it to a df

Question

I am trying to read a url csv file from bitbucket and I want to read it into a df using python. Also for the work I am doing I can not read it locally , it has to be from bitbucket all the time.

Any ideas on how to do this? Thank you!

Here is my example:

url = 'https://bitbucket.EXAMPLE.com/EXAMPLE/EXAMPLE/EXAMPLE/EXAMPLE/raw/wpcProjects.csv?at=refs%2Fheads%2Fmaster'

colnames=['project_id','project_name','gourmet_url']

df7 = pd.read_csv(url, names =colnames)

However, the output is not correct, its not the df being outputted its some bad data.

Please see this thread: https://stackoverflow.com/questions/32400867/pandas-read-csv-from-url — Akshay Gupta, Nov 30 '21 at 20:04
@AkshayGupta Thank you, I tried that but it does not work for bitbucket, not sure why — Rami Shehadah, Nov 30 '21 at 20:09
@RamiShehadah See the edit in my answer. But also, can you edit your question to include an example file that's not working, so that the answer can provide a full working example? Thanks. — Emir, Nov 30 '21 at 20:17

Emir · Accepted Answer · 2021-11-30T20:36:18.320

You have multiple options, but your question is actually 2 separate questions.

How to get a file (.csv in this case) from a remote location.
How to load a csv into a "df" which is a pandas data frame.

For #2, you simply import pandas, and use the df = pandas.read_csv() function call. See the documentation! If the CSV file was in the current directory, you would do pandas.read_csv('myfile.csv')

The CSV is on a server somewhere. In this case, it happens to be on bitbucket's servers accessed from their website. You can fetch it and save it locally, then access it, or you can fetch it to a temporary location, read it into pandas, and discard it. You could even read the data from the file into python as a string. However, having a lot of options doesn't mean they are all useful. I am just listing them for completeness. Looking at the documentation, pandas already has remote fetching built into the read_csv() function. If the passed in path is a valid URL scheme, where, in pandas,

"Valid URL schemes include http, ftp, s3, gs, and file".

If you want to locally save it, you can use pandas to do so once again, using the .write() method of a data frame.

FOR BITBUCKET SPECIFICALLY: You need to make sure to link to the 'raw' file on bitbucket. Get the link to the raw file, and pass that in. The link used to view the file on your web browser is not the direct link to the raw file by default, it's a webpage that offers a view into that file. Get the raw file link, then pass that into pandas.

Code example: Assume we want (a random csv file I found on bitbucket): https://bitbucket.org/pedrorijo91/nodejstutorial/src/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv?at=master

What you need is a link to the raw file! clicking on ... and pressing 'open raw' we get:

https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv

Let's look at this in detail, the link is the same up to the project name: https://bitbucket.org/pedrorijo91/nodejstutorial/

afterwards, the raw file is under raw/

then it's the same pointer (random but same letters and numbers) db4c991864e65c4d72e98a1dc94e33606e3adde9/

Finally, it's the same directory structure:

node_modules/levelmeup/data/horse_js.csv

The first link ends with a ?at=master which is parsed by the web server and originates from src/ at the web server. The second link, the actual link to the raw file, starts from raw/ and ends with .csv

import pandas as pd
RAW_Bitbucket_URL = 'https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv'
df = pd.read_csv(RAW_Bitbucket_URL)

The above code is successful for me.

That is in my answer already. Thank you for editing to make the answer clearer though. I appreciate it. It's possible I posted before typing the full answer. — Emir, Nov 30 '21 at 20:18
I tried reading the url into pandas_read_csv, however the data is not what its supposed to be — Rami Shehadah, Nov 30 '21 at 20:19
@RamiShehadah please provide an example dataset so I can provide a working answer. Please edit your answer with the example dataset/link and your code. You could also post what you expect the result to be vs. what you get, if you really wanted to ask a great question. — Emir, Nov 30 '21 at 20:22
Also, if you read my full response, I offer an alternative for you. Download the dataset locally, and read it in. Now this is not the problem, but the most likely issue you are running into is that you are reading in an html page or some other thing that's not the CSV file. Alternatively, the CSV file is formatted poorly and you need to do some wrangling to get it into a good format. — Emir, Nov 30 '21 at 20:25
thank you @Emir, i updated my question. Also for the work I am doing I can not read it locally , it has to be from bitbucket all the time — Rami Shehadah, Nov 30 '21 at 20:28
@RamiShehadah. Please see my edit and code example. You are not querying the actual file in your example link. To get the actual file link on the website, currently, you need to go to ... next to the filename and press open as raw. However, as I describe, you can infer the raw filename from the non-raw version, by swapping out src/ for raw/ and ending the link with the filename, not passing in any further arguments using the ?parameter=value format used in most web servers. — Emir, Nov 30 '21 at 20:37

score 0 · Answer 2 · answered Nov 30 '21 at 21:26

You may need to download the entire file so you can try to make the request with requests and then read it as a file in pandas.read_csv().

>>> import pandas as pd
>>> import requests
>>> url = 'https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv'
>>> r = requests.get(url, allow_redirects=True)
>>> open('file.csv', 'wb').write(r.content)
>>> pd.read_csv('file.csv', encoding='utf-8-sig').head()

                   ID                                              Tweet                 Date                 Via
0  374667940827635712             So, yes, a 100% JS App is 100% awesome  08:59:32, 9-3, 2013                 web
1  374656867466637312  "vituperating priests" who rail against JavaSc...  08:15:32, 9-3, 2013                 web
2  374654221292806144    Node/Browserify/CJS folks, is there any benefit  08:05:01, 9-3, 2013  Twitter for iPhone
3  374640446955212800     100% JavaScript applications. You may get some  07:10:17, 9-3, 2013  Twitter for iPhone
4  374613490763169792       A node.js app that will order you a sandwich  05:23:10, 9-3, 2013                 web

Read a csv file from bitbucket using Python and convert it to a df

2 Answers2