0

I am trying to follow curl error 18 - transfer closed with outstanding read data remaining.

The top answer is to

...let curl set the length by itself.

I don't know how to do this. I have tried the following:

curl --ignore-content-length http://corpus-db.org/api/author/Dickens,%20Charles/fulltext

However, I still get this error:

curl: (18) transfer closed with outstanding read data remaining
BSMP
  • 4,596
  • 8
  • 33
  • 44
Mathew
  • 1,116
  • 5
  • 27
  • 59
  • probably a server-compressed-size-bug where the server sends the wrong Content-Length header if the client doesn't add a `Accept-Encoding` header, i've seen it before, try `curl --compressed http://corpus-db.org/api/author/Dickens,%20Charles/fulltext` - and if i turn out to be right, you should contact the server owners and let them know of the bug in their server. – hanshenrik Mar 23 '19 at 11:40
  • 1
    The API returns a: `SyntaxError: JSON.parse: unterminated string at line 1 column 26636273 of the JSON data`. Not sure if you can do anything about that. Just visit the page in a browser (I tried Mozilla) http://corpus-db.org/api/author/Dickens,%20Charles/fulltext – Mr.Turtle Mar 26 '19 at 09:20
  • It works fine on my machine may be its connection issue – Bhupesh lad Apr 01 '19 at 00:35

2 Answers2

2

The connection is just getting closed by the server after 30 seconds. You can try to increase speed of the client but if the server is not delivering enough in the limited time you get the message even with fast connection.

In the case of the example http://corpus-db.org/api/author/Dickens,%20Charles/fulltext I got a larger amount of content with direct output:

curl http://corpus-db.org/api/author/Dickens,%20Charles/fulltext

while the amount was smaller while writing in a file (already ~47MB in 30 seconds):

curl -o Dickens,%20Charles http://corpus-db.org/api/author/Dickens,%20Charles/fulltext

Resuming file transfers can be tried, but on the example server it's not supported:

curl -C - -o Dickens,%20Charles http://corpus-db.org/api/author/Dickens,%20Charles/fulltext

curl: (33) HTTP server doesn't seem to support byte ranges. Cannot resume.

So there might be options to optimize the request, to increase the connection-speed or the cache-size but if you reached the limit and never get more data in the limited time you can't do anything.

The cUrl manual can be found here: https://curl.haxx.se/docs/manual.html

The following links won't help you but perhaps are interesting:
The repository for the data-server can be found here: https://github.com/JonathanReeve/corpus-db
The documentation for the used web-server can be found here: https://hackage.haskell.org/package/warp-3.2.13

David
  • 5,882
  • 3
  • 33
  • 44
  • 1
    `already ~47MB in 30 seconds` - your internet connection is too slow, this is what i get from a 1Gbit connection: ```root@x:~# time curl http://corpus-db.org/api/author/Dickens,%20Charles/fulltext > out.txt % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 50.7M 0 50.7M 0 0 5478k 0 --:--:-- 0:00:09 --:--:-- 5933k real 0m9.515s user 0m0.024s sys 0m0.112s root@x:~# root@x:~# du --bytes out.txt 53222321 out.txt ``` - it's 53.2 megabytes – hanshenrik Apr 01 '19 at 14:41
  • That's right, I never got the whole data and always the error-message. – David Apr 01 '19 at 18:57
  • would be interesting too if this: `curl http://corpus-db.org/api/author/Dickens,%20Charles/fulltext > out.txt` is about the transfer-amount the same like this: `curl -o out.txt http://corpus-db.org/api/author/Dickens,%20Charles/fulltext`. So that site is a good benchmark, just being worried that they don't like it perhaps :/ – David Apr 01 '19 at 19:06
  • Seems I got it now with each notation once completely, but I've differences in each execution so it's no reliable benchmark or just showing that connection-speed is not always the same. – David Apr 01 '19 at 19:20
1

It's a speed issue. The server at corpus-db.org will DISCONNECT YOU if you take longer than 35 seconds to download something, regardless of how much you've already downloaded.

To make matters worse, the server does not support Content-Range, so you can't download it in chunks and simply resume download where you left off.

To make matters even worse, not only is Content-Range not supported, but it's SILENTLY IGNORED, which means it seems to work, until you actually inspect what you've downloaded.

If you need to download that page from a slower connection, I recommend renting a cheap VPS, and set it up as a mirror of whatever you need to download, and download from your mirror instead. Your mirror does not need to have the 35-second-limit.

For example, this vps1 costs $1.25/month has a 1Gbps connection, and would be able to download that page. Rent one of those, install nginx on it, wget it in nginx's www folder, and download it from your mirror, and you'll have 300 seconds to download it (nginx default timeout) instead of 35 seconds. If 300 seconds is not enough, you can even change the timeout to whatever you want.

Or you could even get fancy and set up a caching proxy compatible with curl's --proxy, parameter so your command could become

curl --proxy=http://yourserver http://corpus-db.org/api/author/Dickens,%20Charles/fulltext

If someone is interested in an example implementation of this, let me know.

You can't download that page with a 4mbit connection because the server will kick you before the download is complete (after 35 seconds), but if you download it with a 1000mbit connection, you'll be able to download the entire file before the timeout kicks in.

(My home internet connection is 4mbit, and I can't download it from home, but I tried downloading it from a server with a 1000mbit connection, and that works fine.)

1PS: I'm not associated with ramnode in any way, except that I'm a (prior) happy customer of them, and I recommend them to anyone looking for cheap reliable VPSs.

hanshenrik
  • 19,904
  • 4
  • 43
  • 89