-6

Im learning NLTk and i need to load in a large file and i dont want to save it on my desktop How can i read in a file with python thats hosted on a website?

I tried this code here but it didnt work, i assume that the open with is the rson for it but i need to use open with because i need to save it as a file - myfile in this case.

import nltk

with open('http://www.sls.hawaii.edu/bley-vroman/brown.txt', 'r')as myfile:
    data=myfile.read().replace('\n', 'r')

data2 = data.replace("/", "")

for i, in line in enummerate(data2.split('\n')):
    if i>10:
        break
    print(str(i) + ':\t' + line)

and this is the error:

Traceback (most recent call last):
  File "tut1.py", line 3, in <module>
    with open('http://www.sls.hawaii.edu/bley-vroman/brown.txt', 'r')as myfile:
FileNotFoundError: [Errno 2] No such file or directory: 'http://www.sls.hawaii.edu/bley-vroman/brown.txt'

What can i do to use the file in my script without downloading the whole file?

I changed the code to work with requests

import nltk
import requests

myfile = requests.get('http://www.sls.hawaii.edu/bley-vroman/brown.txt')

data=myfile.read().replace('\n', 'r')

but now when i run this i get this error:

Traceback (most recent call last):
  File "tut1.py", line 6, in <module>
    data=myfile.read().replace('\n', 'r')
AttributeError: 'Response' object has no attribute 'read'
yappy twan
  • 801
  • 1
  • 8
  • 11
  • 1
    In python `open()` it works quite different from `file_get_contents()` in php. To perform http request you can use either built-in [`urllib`](https://docs.python.org/3/library/urllib.html) or [`requests`](https://requests.readthedocs.io/en/master/) (*or dozens of anothe third-party libs*) – Olvin Roght Nov 21 '20 at 19:44
  • 1
    Do not share code in comments, [edit](https://stackoverflow.com/posts/64947319/edit) your question and add code there. – Olvin Roght Nov 21 '20 at 19:50
  • no it doesnt, because i dont have image i have text – yappy twan Nov 21 '20 at 20:01
  • Hi, just use [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) – grumpyp Nov 21 '20 at 20:09

3 Answers3

2

If you want to process the first N (here 10) lines of the file, never reading the whole response into memory, here's how to do that:

import nltk
import requests

myfile = requests.get('http://www.sls.hawaii.edu/bley-vroman/brown.txt', stream=True).raw

for i in range(0, 10):
    line = myfile.readline()
    data = line.decode().replace('\\n', 'r')
    print(data, end="")

Result:

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced "no evidence" that any irregularities took place. The jury further said in term-end presentments that the City Executive Committee, which had over-all charge of the election, "deserves the praise and thanks of the City of Atlanta" for the manner in which the election was conducted.

The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible "irregularities" in the hard-fought primary which was won by

The three problems I fixed are:

  1. requests.get() doesn't return a file-like object. Add .raw to get that, and add stream=True to the request as well to get it to act right.
  2. You're calling read(), which will work once you address #1, but will read in the whole file. That's not what you want. I assume you want to read line by line by calling readline().
  3. You have to decode the incoming bytes to text before you can operate on them with string methods. That' what the decode() does.

Of course, to process 10 lines instead of 1, you need a loop and a way to do just 10 lines. I added that as well. I also added a print() call so we could all see the results.

I assume that the replace() in my code isn't really quite what you want. I'm guessing that you meant replace('\\n', '\\r'), but since I wasn't sure (I don't know what that buys you), I left that to you to deal with. I did fix it so that it didn't completely wipe out the line (not sure why it does that) by adding a second backslash to the search term.

CryptoFool
  • 21,719
  • 5
  • 26
  • 44
0

You can access the content of that .txt file with no errors like this:

import requests

myfile = requests.get('http://www.sls.hawaii.edu/bley-vroman/brown.txt')

data = myfile.text
Patrik
  • 499
  • 1
  • 7
  • 24
0

There's iter_lines() which allows you to consume streaming content line by line:

resp = requests.get('http://www.sls.hawaii.edu/bley-vroman/brown.txt', stream=True)
for i, l in enumerate(resp.iter_lines()):
    if i < 10:
        print(l)  # use l.decode() to get string
    else:
        break
resp.close()  # to not hang connection anymore

Or even simpler:

for _, l in zip(range(10), resp.iter_lines()):
    print(l)  # use l.decode() to get string

Or the best:

from itertools import islice

print(*islice(resp.iter_lines(), 10), sep="\n")
Olvin Roght
  • 7,677
  • 2
  • 16
  • 35