-1

I am using Python 3.4. I have tried find solution in the web, but still didn't.

I have a link to csv file (data set).

Is there a way to fetch information from this link without duplicating it in the local directory?(e.g., I don't have enough space on the disk)

I would like to continue working with data that will be in the RAM.(e.g., I am planning to find out how many datarows and have to do some Data Mining and Filtering stuff, currently not important what it will be)

Dmytro Chasovskyi
  • 3,209
  • 4
  • 40
  • 82
Alex Ljamin
  • 737
  • 8
  • 31
  • 1
    Make an HTTP GET request to the URL for the file, then count the number of lines.... Seems straightforward. Where are you stuck, what have you tried? – OneCricketeer Feb 17 '16 at 08:23
  • @cricket_007 I have tried this [solution](http://stackoverflow.com/questions/16108526/count-how-many-lines-are-in-a-csv-python) but it only works for local csv files – Alex Ljamin Feb 17 '16 at 08:38
  • Okay, that works, but you need to get the file content from opening a URL instead of a file. Using the requests library as mentioned in the answer is a quick way to do it – OneCricketeer Feb 17 '16 at 08:48
  • There is a valid solution provided by @rolf_of_saxony. What is the reason for such outrageous down voting and putting question on hold? How am I supposed to ask questions if each of them gets down voted? That's would also be more fair if a down voter could be more specific in explaining why the heck he down voted. According to the [help center](http://stackoverflow.com/help/reopen-questions) I am able to apply for the question reopening, please proceed in such a way or specify what's missing. – Alex Ljamin Feb 17 '16 at 10:06
  • 1
    I didn't downvote, but I can only guess that you didn't include what you've tried or do research on how to get a file content from a website. Please refer to here if you'd like not be be downvoted. http://stackoverflow.com/help/how-to-ask. Also, the answer has been accepted, so there is no reason to reopen. – OneCricketeer Feb 17 '16 at 13:11
  • @cricket_007 okay I see. What about down voted answer? It fixes my issue, but I cannot up vote as i don't have enough reputation. – Alex Ljamin Feb 17 '16 at 13:18
  • Upvotes are 10 points, accepting is 15, so don't feel too bad that you can't upvote, if you want the reputation, ask good questions or answer others. The answer was already downvoted when I saw it, but I'm guessing it originally didn't completely answer the question and it recommended unnecessary modules such as requests, BeautifulSoup, and the CSV module – OneCricketeer Feb 17 '16 at 13:23

1 Answers1

1

Try the following:

import requests
r = requests.get('http://127.0.0.1/some_path/small.csv')
print len(r.content.split('\n')) -1

Result: 10

for small.csv file as follows:

1lpcfgokakmgnkcojhhkbfbldkacnbeo,6B5108
pjkljhe2ncpnkpknbcohdijeoejaedia,678425
apdfllc5aahabafndbhieahigkjlhalf,651374
aohghmighlieiainnegkcijnfilokake,591116
coobgpohoikkiipiblmjeljniedjpjpf,587200
dmgjnkhnkblpmfjpdakehnaikgdjllic,540979
felcaaldnbdncclmgdcncolpebgiejap,480535
aapocclcgogkmnckokdopfmhonfmgoek,480441
pdehmppfilefbolgganhfihpbmjlgebh,273609
nafaimnnclfjfedmmabolbppcngeolgf,105979

Edit: (As suggested by MHawke)

import requests
line_cnt=0
r = requests.get('http://127.0.0.1/some_path/small.csv',stream=True)
for i in r.iter_lines():
    if i.strip():
        line_cnt +=1
print (line_cnt)

This version does not count blank lines and should be more efficient for a large file because it uses iter_lines

iter_lines(chunk_size=512, decode_unicode=None, delimiter=None)

Iterates over the response data, one line at a time. When stream=True is set on the request, this avoids reading the content at once into memory for large responses.

(Note: not re-entrant safe)

Rolf of Saxony
  • 21,661
  • 5
  • 39
  • 60
  • tried your solution with requests package installed it gave me a syntax error. Just got it fixed with 'print (len(r.content.split()))' – Alex Ljamin Feb 17 '16 at 09:19
  • You are using Python3 then! – Rolf of Saxony Feb 17 '16 at 09:22
  • Updated tags accordingly. Your solution works great! Btw, I don't know who is constantly downvoting every of my questions. – Alex Ljamin Feb 17 '16 at 09:24
  • This answer was downvoted because the use of `split()` will fail whenever there is whitespace in the data. – mhawke Feb 17 '16 at 20:52
  • @RolfofSaxony: perhaps a better way yet is to use `r.iter_lines()`. If combined with setting `stream=True` when making the request memory usage is very low. Anyway, I've removed the downvote. You should probably also filter out any blank lines too. – mhawke Feb 18 '16 at 10:21
  • @mhawke Added your suggestion which is much cleaner – Rolf of Saxony Feb 18 '16 at 15:16
  • @RolfofSaxony: even better: `sum(1 for line in r.iter_lines() if line.strip())`. – mhawke Feb 18 '16 at 20:42