Reading a url gz file into python using pandas

Question

I am attempting to read this URL into python - http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names. It contains a dataset. it keeps throwing a file not found error response. Any possible solutions?

I tried this -

#Code: Importing libraries and reading features list from ‘kddcup.names’ file.

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
  

  # reading features list
with open("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names", 'r') as f:
    print(f.read())

Jan_B · Answer 1 · 2023-08-04T09:59:55.850

You cannot read website content as if it was a file on your computer.

If you inspect the site with your browser tools you will notice that the data you want to load is in a HTML pre tag of the body of the site. A really short search on this site reveals following question with information on how to extract the data from the site: cannot-extract-pre-tag-from-webpage

from urllib.request import urlopen

website_content = urlopen('http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names').read().decode('utf-8')

# now lets process the content string to a more functional dictionary
data = {}
for part in website_content.split('\n'):
    if ':' in part:
        data[part.split(': ')[0]] = part.split(': ')[1]
    elif ',' in part:
        data['header'] = part.split(',')

Now you have a dictionary with the data on the website.

Reading a url gz file into python using pandas

1 Answers1