0

I am attempting to convert a webpage from a format I don't understand to ascii so I can look for certain data. I retrieve the data using webclient with a url of the web page and then using encoding to convert the data from what I think is unicode to ascii but the format doesn't change at all. Below is my code:

WebClient web = new WebClient();
string page = "https://www.myurl.com/";

Stream data = web.OpenRead(page);
StreamReader reader1 = new StreamReader(data);
string input = reader1.ReadToEnd();
Encoding unicode = Encoding.Unicode;
Encoding ascii = Encoding.ASCII;

string webpage = ascii.GetString(
  Encoding.Convert(unicode, ascii, unicode.GetBytes(input))
);

Below is what the webpage data looks like which is the same as the input data which suggests my conversion didn't work.

     \"sprited\":true,\"spriteCssClass\":\"sx_a11c08\",\"spriteMapCssClass\":\"sp_SN-oNOqlzVS\"},\"505789\":{\"sprited\":true,\"spriteCssClass\":\"sx_5219b1\",\"spriteMapCssClass\":\"sp_SN-oNOqlzVS\"},\"505782\":{\"sprited\":true,\"spriteCssClass\":\"sx_c0671f\",\"spriteMapCssClass\":\"sp_SN-oNOqlzVS\"},\"505794\":{\"sprited\":true,\"spriteCssClass\":\"sx_8cf344\",\"spriteMapCssClass\":\"sp_SN-oNOqlzVS\"},\"495429\": 

Does anyone know what kind of data this is and how to convert it into data I can understand? When I show the page source of the webpage on the browser none of this weird data shows up. In other words the data I get from the webclient doesn't look at all like the page source on the browser.

AKX
  • 152,115
  • 15
  • 115
  • 172
Dave
  • 873
  • 2
  • 15
  • 27
  • 2
    That looks like partial JSON with backslashes escaped. If possible, can you provide the actual URL you're trying to access? – AKX Jan 25 '19 at 16:48
  • I don't think you have a problem with character encoding, its not an issue with how the characters are represented as bytes.Your problem is, you are expecting HTML and you are getting something else, looks a bit like JSON. – Jodrell Jan 25 '19 at 16:52
  • 1
    If it's encoded in UTF8, the ASCII range will look the same. – Heretic Monkey Jan 25 '19 at 16:52
  • 1
    I do not see anything that looks like unicode. What are you trying to change? – jdweng Jan 25 '19 at 16:52
  • I am accesssing my facebook page with my user id which I don't really want to show. it s https://www.facebook.com/my id. Some of the data does look like web data but not all. – Dave Jan 25 '19 at 16:53
  • The site may be using the fact that your web client isn't presenting itself as a browser with the `User-Agent` as an incentive to present its data as JSON, rather than rendering it in HTML. Alternatively, the "source" you're inspecting may be a DOM tree already modified by the site executing JavaScript, which your download wouldn't run. Try something like Fiddler to see what's actually going over the wire. – Jeroen Mostert Jan 25 '19 at 16:53
  • What [`Accept-Encoding`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Encoding) header are you sending with your request? If you are calling an Web API you are probably going to get JSON back by default. – Jodrell Jan 25 '19 at 16:55
  • It seems like I need to add the User-Agent type to my webclient. Does anyone know what user-agent value to use? – Dave Jan 25 '19 at 16:57
  • *webpage data looks like* looks like where? Because Visual Studio will escape `"` automatically. That does not mean that the `"` is escaped in the underlying data. To be clear there is no encoding issue here. That encoding is fine, the `\"` is an escape sequence to allow you to include `"` in strings, e.g. `string test = "this \"string\" is a string";` – Liam Jan 25 '19 at 16:57
  • If you seek some data what obstacle is UTF-8 text format? Web pages usually have they format specified in header. You can check with this. – Yarl Jan 25 '19 at 17:02
  • Isn't it against facebooks EULA to crawl their pages? – Lukazoid Jan 25 '19 at 17:12
  • I am using my facebook page only for testing. Besides what prevents Google from crawling webpages, including Facebook pages – Dave Jan 25 '19 at 17:20

2 Answers2

0

Is that the full web page data below? It looks incomplete on both ends.To me, it looks like JSON data to me. You can convert it into a C# object by using the JavaScriptSerializer class.

JavaScriptSerializer json_serializer = new JavaScriptSerializer();
Test resultingData = (Test)json_serializer.DeserializeObject(your_data);
0

If you want to read JSON from a request, do it like here,

var json = web.DownloadString(page);

Then you need to deserialize the string into an object, if you know the type of the model in response, you can do it like this, lets day its ResponseType.

using Newtonsoft.Json;

...

var result = JsonConvert.DeserializeObject<ResponseType>(json);

There is a NuGet package called Facebook which you can import to your project. This will give you some models that might match up with the type.


If you don't know the type of the response you could do something like this,

using Newtonsoft.Json.Linq;

...

var jObject = JObject.Parse(json);

Then you can use LINQ to query the object.

Jodrell
  • 34,946
  • 5
  • 87
  • 124
  • I added the user-agent "Mozilla/4.0" to my webclient Headers and now the webclient result will not let me read my Facebook webpage giving me an open error. If I remove the user-agent from the Headers and try to read the data stream returned with a json reader, I get an error saying it found an error at position 0. So it seems that it is impossible to correctly read a facebook webpage. I will tried using another webpage for my testing and everything came out the way I expected. – Dave Jan 25 '19 at 19:31
  • @Dave, is that a new question? My answer assumes you can get the JSON, like in the question. – Jodrell Jan 28 '19 at 09:25