5

So far, I have the following piece:

local socket = require "socket.http"
client,r,c,h = socket.request{url = "http://example.com/", proxy="<my proxy and port here>"}
for i,v in pairs( c ) do
  print( i, v )
end

which gives me an output like the following:

connection  close
content-type    text/html; charset=UTF-8
location    http://www.iana.org/domains/example/
vary    Accept-Encoding
date    Tue, 24 Apr 2012 21:43:19 GMT
last-modified   Wed, 09 Feb 2011 17:13:15 GMT
transfer-encoding   chunked
server  Apache/2.2.3 (CentOS)

which means that the connection established just perfectly. Now, I want to fetch the title of my url's using this socket.http. I searched previous SO questions and the luasocket's http documentation. but, I still have no idea on how to fetch/store the whole/part of the page in a variable and do something with it.

Please help.

Community
  • 1
  • 1
hjpotter92
  • 78,589
  • 36
  • 144
  • 183

1 Answers1

4

You are using the 'generic' form of http.request(), which requires storing the body via a LTN12 sink. It's not as complicated as it sounds, try this code:

local socket = require "socket.http"
local ltn12 = require "ltn12"; -- LTN12 lib provided by LuaSocket

-- This table will store the body (possibly in multiple chunks):
local result_table = {};
client,r,c,h = socket.request{
    url = "http://example.com/",
    sink = ltn12.sink.table(result_table),
    proxy="<my proxy and port here>"
}
-- Join the chunks together into a string:
local result = table.concat(result_table);
-- Hacky solution to extract the title:
local title = result:match("<[Tt][Ii][Tt][Ll][Ee]>([^<]*)<");
print(title);

If your proxy is constant throughout your application then a more straightforward solution would be to use the simple form of http.request(), and specify the proxy via http.PROXY:

local http = require "socket.http"
http.PROXY="<my proxy and port here>"

local result = http.request("http://www.youtube.com/watch?v=_eT40eV7OiI")
local title = result:match("<[Tt][Ii][Tt][Ll][Ee]>([^<]*)<");
print(title);

Output:

    Flanders and Swann - A song of the weather
  - YouTube
MattJ
  • 7,924
  • 1
  • 28
  • 33
  • Thanks! This works great in general with all kinds of pages. :) But, on trying to fetch the title of youtube links, the `result` variable has only the [**404 error**](http://www.hastebin.com/gikavorone.xml) page in it. I tried both the methods. The second one fetches the pages quicker. :) – hjpotter92 Apr 25 '12 at 03:33
  • I just updated with an example YouTube link and the output I get. It all works fine for me. The title has whitespace padding in, and probably HTML entities sometimes too. You'll probably want to normalize it a bit by stripping and converting those. – MattJ Apr 25 '12 at 04:16
  • Nope, didn't work yet. I am running the file(named `02.lua`) in SciTe. Here's the screenshot of the output and the code(I used 4 different web-pages, 2 on my own web-server). Check: http://i.stack.imgur.com/XkQQj.jpg – hjpotter92 Apr 25 '12 at 04:36
  • 1
    Interesting. I can only guess it's something to do with your proxy, as that's the only difference between your code and mine. To debug something like this I would usually reach for Wireshark, and log the request and response to see if there's anything unexpected. Did you say it works with some pages but not all? – MattJ Apr 25 '12 at 12:27
  • Here is another screenshot with a sample of some websites. http://i.stack.imgur.com/SC7jw.jpg As you can see, `youtube.com` is redirected to **google** and `stackoverflow` is not even being opened, other than that, the luasocket's official page is returning **404 error**. – hjpotter92 Apr 26 '12 at 07:52
  • Ok, based on that screenshot I've fathomed what is going on. Your proxy is sending the Host header it receives instead of the correct one based on the destination URL. You might be able to use the advanced version of http.request to set a custom host header, but I haven't tested it. – MattJ Apr 27 '12 at 22:30
  • Thanks a lot. :) I tried using custom headers, and 404 seems no longer the response. But now, the links are being redirected. Please check: http://stackoverflow.com/q/10360632/1190388 – hjpotter92 Apr 28 '12 at 04:39