HttpWebRequest or HttpWebClient on at least 3GB of data

Question

I'm trying to get the data from an API URL containing at least 3GB of data (since this is an enterprise level) and insert it to a table in SQL Server. I'm getting an error of "Insufficient memory to continue the execution of the program". I'm aware that a string can only contain a 2GB maximum of data that's why the following will not work:

string data = client.DownloadString(siteUrl) nor
string data = readStream.ReadToEnd()

We came up with an idea to have a query parameter that will filter the data by region but this will also fail if a certain region exceeds 2GB of data. So is there any other way i can get the whole chunk of data from the API URL?

Edit What we currently have is storing the data from to a string object in C#.

string data = client.DownloadString(siteUrl)

Then deserialize the data so i can filter out the element that i need with the next step.

JsonConvert.DeserializeObject<Dictionary<string, object>>(data)

Since the returned API does not return just a list of data, i need to define a startElement on where the data should be taken.

List<Dictionary<string, string>> arrayOfData = (List<Dictionary<string, string>>)(data[startElement] as Newtonsoft.Json.Linq.JArray).ToObject(typeof(List<Dictionary<string, string>>));

I iterate thru arrayOfData.Count to match the mapping of column in sql and the data for it then store it to a List> which will then be inserted to the database by a SQL Provider. Below is a sample data structure (since actual data is confidential)

{"href" : "someUrl","limit" : "1","records" : [{"columnA":"1","columnB":"2","columnC":"3"}]}

The problem is, i cannot do the processing since 3GB of data is too much for a string to handle so i'm asking if there are any other way to store (i know i can store it in a file but the data makes it difficult to be inserted as is).

@NtFreX but then how can i parse the data to insert it to a table? — Chester Lim, Jun 30 '17 at 14:50
Relational database are meant to store relation data, that is what they excel at. What they do poorly is storing large unstructured data sets. You should consider a different approach to storing 3GB of unstructured data. a good and commonly used alternative is to store the data on the file system and store a pointer to that file in the database. If you must store the data in the database consider using [FileStream](https://www.simple-talk.com/sql/learn-sql-server/an-introduction-to-sql-server-filestream/). — Igor, Jun 30 '17 at 14:52
Possible duplicate of [Can I use a stream to INSERT or UPDATE a row in SQL Server (C#)?](https://stackoverflow.com/questions/2459762/can-i-use-a-stream-to-insert-or-update-a-row-in-sql-server-c) — Igor, Jun 30 '17 at 14:56
@TravisActon hello, unfortunately i don't and the data still needs to be deserialized before i can store it to database.This is where i'm having a trouble. — Chester Lim, Jun 30 '17 at 15:00
@ChesterLim - Your question is rather unclear. Do you have problems because you want to store the data directly in the DB? Will the data be stored in an unstructured format in the DB? Why do you have to read it into memory all at once? If you do what is the structure of that data and where is it coming from? What do you have to do *exactly* to the data in memory that you are having this issue to begin with? — Igor, Jun 30 '17 at 15:02
Are you running on a 64 or 32 OS? What version of .net are you using? — Travis Acton, Jun 30 '17 at 15:07
@igor Sorry, my question was too vague. Just added an edit section in the question. — Chester Lim, Jun 30 '17 at 15:26
What is the API from where you are getting the data? Can you append parameters to achieve retrieval by paging/chunking be specifying a page size and page number? — Igor, Jun 30 '17 at 15:29
@Igor i don't have access to change the API since it is being maintained by another team. That's what we are proposing but it will take time and we need the gather the data as soon as possible. If there's no other way then i guess i'll just have to wait. — Chester Lim, Jun 30 '17 at 15:31
Is the resulting structure an array and is the structure fixex? You could read in chunks from the stream until (*using a regex*) a deliminator is reached (*like `},` end of object in array*), and deserialize and process 1 (or fixed number) object at a time before reading more. That would be the only other way to go about *reading* from the stream. You did not make clear what you will be doing with the result. — Igor, Jun 30 '17 at 15:35

score 0 · Answer 1 · answered Jun 30 '17 at 14:45

0

Write to a file instead. eg

using (var file = File.OpenWrite(@"c:\temp\foo.dat"))
{
    readStream.CopyTo(file);
}

answered Jun 30 '17 at 14:45

David Browne - Microsoft

80,331
6
39
67

Sorry i just edited my question. I am aware that i can save the data as a file but my problem is on how to parse the data and insert to a table in sql server since i cannot store it in an object in C#. – Chester Lim Jun 30 '17 at 14:48
Just start reading from the stream, parsing records and loading SQL Server. You'll want to use SqlBulkCopy to load that much data. – David Browne - Microsoft Jun 30 '17 at 14:51
Thank you for this but i don't think this will work for my issue. I added an edit section in my question. – Chester Lim Jun 30 '17 at 15:32

score 0 · Answer 2 · answered Jun 30 '17 at 15:13

0

If you can run on a 64 OS and use .NET 4.5 or higher than this application setting will lift the memory limits of 2G

<runtime>
    <gcAllowVeryLargeObjects enabled="true" />    
</runtime>

In an ideal world the API should be changed to implement paging if has the potential to return that much data.

answered Jun 30 '17 at 15:13

Travis Acton

4,292
2
18
30

Let me try this out! – Chester Lim Jun 30 '17 at 15:33
How did it go for you? – Travis Acton Jun 30 '17 at 19:12
1

Thank you for this but unfortunately, it did not work. I posted an answer that worked for me, which was suggested by Igor in the comments. – Chester Lim Jul 01 '17 at 09:42

score 0 · Accepted Answer · answered Jul 01 '17 at 09:40

As suggested by Igor from the comments:

You could read in chunks from the stream until (using a regex) a deliminator is reached (like }, end of object in array), and deserialize and process 1 (or fixed number) object at a time before reading more.

So i did and it worked! To anyone interested, here are the steps:

Instead of storing a big chunk of data to a string client.DownloadString(siteUrl), the data was downloaded to a file client.DownloadFile(siteUrl, "data.json").
Used StreamReader to read each character from the file using (StreamReader sr = new StreamReader(@"E:Debug\data.json")).
Created a method that will read each character from the downloaded file and begin storing it if a "[" is encountered.
Store each character to a List<char> and create a new string new String(chars.ToArray())if it hits a "}" - note that 1 "{}" is equivalent to 1 valid record.
JSON deserialize the string with JsonConvert.DeserializeObject<Dictionary<string, string>>(record)
Mapped the data to each column in SQL Server table inserted the data.
Clears the List<char> after inserting the data to process the next record.

With this process, only 1 record at a time is being stored. It works but takes too long to finish. Let me know if there is a better way to handle this!

HttpWebRequest or HttpWebClient on at least 3GB of data

3 Answers3