1

I have a huuuuuge csv online and I wan't to read it line by line whitout download it. But this file is behind a proxy. I wrote this code :

import requests
import pandas as pd
import io

cafile = 'mycert.crt'

proxies = {"http":"http://ipproxy:port", "https":"http://ipproxy:port"}
auth = HttpNtlmAuth('Username','Password')
url = 'http://myurl/ressources.csv'

content = requests.get(url, proxies=proxies, auth=auth, verify=cafile).content
csv_read = pd.read_csv(io.StringIO(content.decode('utf-8')))
pattern = 'mypattern'

for row in csv_read:
    if row[0] == pattern:
        print(row)
        break

This code above works but the line 'content = requests.get(...' takes soooo much time ! Because of the size of the csv file.

So my question is : Is it possible to read an online csv line by line through proxy ?

In the best way, I wish to read the first row, check if it equals to my pattern, if yes = break, if not = read the second line ans so on.

Thank's for your help

Jeremy
  • 69
  • 10
  • Does this answer your question? [How to read a CSV file from a URL with Python?](https://stackoverflow.com/questions/16283799/how-to-read-a-csv-file-from-a-url-with-python) – αԋɱҽԃ αмєяιcαη Mar 10 '20 at 13:17

3 Answers3

1

You can pass stream=True to requests.get to avoid fetching the entire result immediately. In that case you can access a pseudo-file object through response.raw, you can build your CSV reader based on that (alternatively, the response object has iter_content and iter_lines methods but I don't know how easy it is to feed that to a CSV parser).

However while the stdlib's csv module simply yields a sequence of lists or dicts and can therefore easily be lazy, pandas returns a dataframe which are not lazy, so you need to specify some special parameters then you get a dataframe per chunk or something it looks like.

Masklinn
  • 34,759
  • 3
  • 38
  • 57
  • Works like a charm ! The 'stream=True' option divide by 4 the requests time ! I just write my code below for some user if they need it ! Thank's ! – Jeremy Mar 10 '20 at 15:02
0

The requests.get call will get you the whole file anyway. You'd need to implement your own HTTP code, down to the socket level, to be able to process the content as it gets in, in a plain HTTP Get method.

The only way of getting partial results and slice the download is to add HTTP "range" request headers, if the server providing the file support then. (requests can let you set these headers).

enter requests advanced usage:

The good news is that requests can do that for you under the hood - you can set stream=True parameter when calling requests, and it even will let you iterate the contents line by line. Check the documentation on that part.

Here is more or less what requests does under the hood so that you can get your contents line by line:

It will get reasobale sized chunks of your data, - but certainly not equest one line at a time (think ~80 bytes versus 100.000 bytes), because otherwise it'd need a new HTTP request for each line,and the overhead for each request is not trivial, even if made over the same TCP connection.

Anyway, as CSV being a text format, neither requests nor any other software could know the size of the lines, and even less the exact size of the "next" line to be read - before setting the range headers accordingly.

So, for this to work, ther have to have to be Python code to:

  • accept a request for a "new line" of the CSV if there are buffered text lines, yield the next line,
  • otherwise make an HTTP request for the next 100KB or so
  • Concatenate the downloaded data to the remainder of the last downloaded line
  • split the downloaded data at the last line-feed in the binary data,
  • save the remainder of the last line
  • convert your binary buffer to text, (you'd have to take care of multi-byte character boundaries in a multi-byte encoding (like utf-8) - but cutting at newlines may save you that)
  • yield the next text line
jsbueno
  • 99,910
  • 10
  • 151
  • 209
  • Or you can just use Requests's `stream=True` mode and `resp.iter_lines()`: https://requests.readthedocs.io/en/master/user/advanced/#streaming-requests – AKX Mar 10 '20 at 13:32
  • I missed that - I updated the answer to offer this option. – jsbueno Mar 10 '20 at 14:52
0

According to Masklinn's answer, my code looks like this now :

import requests

cafile = 'mycert.crt'
proxies = {"http":"http://ipproxy:port", "https":"http://ipproxy:port"}
auth = HttpNtlmAuth('Username','Password')
url = 'http://myurl/ressources.csv'
pattern = 'mypattern'

r = requests.get(url, stream=True, proxies=proxies, verify=cafile)
if r.encoding is None:
    r.encoding = 'ISO-8859-1'

for line in r.iter_lines(decode_unicode=True):
    if line.split(';')[0] == pattern:
        print(line)
        break
Jeremy
  • 69
  • 10