0

Is there anything terribly inefficient here? It seems like this process is taking way longer than it should. I am parsing many JSON files each with a JsonArray of objects. Maybe someone with more experience could point out an error in this method of parsing the JSON into objects, thereby saving me a ton of time.

Also, memory usage slowly creeps upwards MB by MB sometimes causing outofmemoryexceptions..

public void Parse(){

    using (BabysFirstsUsersDataEntities db = new BabysFirstsUsersDataEntities()
    {

           foreach (var path in Directory.EnumerateFiles(@"C:\\examplepath\").OrderBy(f => f))
           {
                    string jsonString = System.IO.File.ReadAllText(path);
                    JToken tok = JObject.Parse(jsonString);
                    Debug.WriteLine("path: " + path);

                    foreach (var x in tok.First.First)
                    {

                            JsonUserImageDTO jdto = x.ToObject<JsonUserImageDTO>();
                            UserImageList uil = jdto.ToDataModel();

                            if (uil.REID != null)
                                db.UserImageLists.Add(uil);    

                    }
            }
      db.SaveChanges();              
    }
}

An example of what one of the JSON strings in each .json file looks like below. Note that there are around 1000 of these files and each can have thousands of such entries:

  {
  "results": [
    {
      "ACL": {
        "asdf": {
          "read": true,
          "write": true
        },
        "role:admin": { "read": true }
      },
      "REID": "exampleID",
      "createdAt": "datetime-string",
      "email": "example",
      "objectId": "example",
      "updatedAt": "datetimestring",
      "urlCount": 1,
      "urlList": [ "exampleurl" ]
    },
    {
      "ACL": {
        "asdf": {
          "read": true,
          "write": true
        },
        "role:admin": { "read": true }
      },
      "REID": "exampleID",
      "createdAt": "datetime-string",
      "email": "example",
      "objectId": "example",
      "updatedAt": "datetimestring",
      "urlCount": 1,
      "urlList": [ "exampleurl" ]
    }
   ]
  }
JakeD
  • 407
  • 2
  • 7
  • 19
  • 1
    1) You should [load](http://www.newtonsoft.com/json/help/html/M_Newtonsoft_Json_Linq_JToken_Load.htm) directly from `Stream` rather than loading to a large intermediate string and then parsing the string, as is explained [here](http://www.newtonsoft.com/json/help/html/Performance.htm). 2) Why is `path.Contains("UserImageList")` inside the inner loop? Can't you skip the file entirely if that is not true? – dbc Feb 14 '17 at 18:37
  • I'll check out #1 and report back... #2 is referring to the name of the file, it is a leftover from my implementation to handle different file types corresponding to different models. I'll change it now as it makes things seem a little more complicated than they are – JakeD Feb 14 '17 at 18:45
  • Like this? StreamReader sr = new StreamReader(path); JsonReader reader = new JsonTextReader(sr); var tok = JObject.Load(reader); – JakeD Feb 14 '17 at 18:59
  • Have you actually profiled your code? See Erik Lippert's [performance rant](https://ericlippert.com/2012/12/17/performance-rant/): **Use a profiler or other analysis tool to determine empirically where the bottleneck is before you start investigating alternatives.**. – dbc Feb 14 '17 at 19:04
  • 1
    How many files are there and do you really need the orderby? That orderby will cause the system to enumerate all files in the folder before starting the loop. – Matthew Whited Feb 14 '17 at 19:05
  • @Matthew Whited The orderby is for my own use to see how far along in the files I get in a certain amount of time. There are around 1000 files and each one contains many entries. I am talking about reducing the performance by hours, so I don't think enumerating once is really a big deal here – JakeD Feb 14 '17 at 19:08
  • @dbc noted.. thanks, I will check that out and will profile this to see what is taking the longest. – JakeD Feb 14 '17 at 19:08
  • Since your code currently works, this question might be more appropriate for http://codereview.stackexchange.com/ – dbc Feb 14 '17 at 19:34

3 Answers3

1

It looks like there could be several places that could causing the slowness.

  1. Deserialzing JSON
  2. Transforming the object twice (jdto, then uil)
  3. Saving to the database

It may be worth profiling the code to find out exactly what part is taking longer than you'd expect. That said there are some things you can do to generally improve this code.

  1. Deserialize from a steam instead of a string. The way you have it, you basically have the object in memory twice-- once as a string, then once as tok. See the second example in the docs for how to use a stream. Actually, in your case you the same information in memory 4 times -- the string, tok, jdto, and uil. Which brings me to the next point..
  2. Try to eliminate some of the intermediate representations of your object. Generally, the more objects you have laying around, the more time you will spend waiting on the GC.
  3. Move the filtering on the path name to the part where you call EnumerateFiles(). There is no sense in deserializing a file if you are not going to do anything with it.
Mike Hixson
  • 5,071
  • 1
  • 19
  • 24
1

Have you actually profiled your code? See Erik Lippert's performance rant: Use a profiler or other analysis tool to determine empirically where the bottleneck is before you start investigating alternatives. For instance, you actual performance problem may be somewhere in the BabysFirstsUsersDataEntities db class.

That being said, my immediate reaction is that you have too many intermediate representations of your data, the construction, population and garbage collection of which all take time. These include:

  • The jsonString which may be large enough to go on the large object heap, and thus permanently impair the performance and memory use of your process.
  • The JToken tok representation of your entire JSON hierarchy.
  • Each individual JsonUserImageDTO.

What I would suggest is to eliminate as many of these intermediate representations as possible. As suggested in the documentation you should load directly from a stream rather that loading to a string and parsing that string.

You can also eliminate the JToken tok by populating your data model directly. Let's say your BabysFirstsUsersDataEntities looks like this (I'm just guessing here):

public class BabysFirstsUsersDataEntities
{
    public BabysFirstsUsersDataEntities() { this.UserImageLists = new List<UserImageList>(); }

    public List<UserImageList> UserImageLists { get; set; }
}

public class UserImageList
{
    public string email { get; set; }
    public List<string> urlList;
}

And your DTO model looks something like this model provided by http://json2csharp.com/:

public class RootObjectDTO
{
    public ICollection<JsonUserImageDTO> results { get; set; }
}

public class JsonUserImageDTO
{
    public ACL ACL { get; set; }
    public string REID { get; set; }
    public string createdAt { get; set; }
    public string email { get; set; }
    public string objectId { get; set; }
    public string updatedAt { get; set; }
    public int urlCount { get; set; }
    public List<string> urlList { get; set; }

    public UserImageList ToDataModel()
    {
        return new UserImageList { email = email, urlList = urlList };
    }
}

public class Asdf
{
    public bool read { get; set; }
    public bool write { get; set; }
}

public class RoleAdmin
{
    public bool read { get; set; }
}

public class ACL
{
    public Asdf asdf { get; set; }

    [JsonProperty("role:admin")]
    public RoleAdmin RoleAdmin { get; set; }
}

Then create the following generic ConvertingCollection<TIn, TOut> utility class:

public class ConvertingCollection<TIn, TOut> : BaseConvertingCollection<TIn, TOut, ICollection<TIn>>
{
    readonly Func<TOut, TIn> toInner;

    public ConvertingCollection(Func<ICollection<TIn>> getCollection, Func<TIn, TOut> toOuter, Func<TOut, TIn> toInner)
        : base(getCollection, toOuter)
    {
        if (toInner == null)
            throw new ArgumentNullException();
        this.toInner = toInner;
    }

    protected TIn ToInner(TOut outer) { return toInner(outer); }

    public override void Add(TOut item)
    {
        Collection.Add(ToInner(item));
    }

    public override void Clear()
    {
        Collection.Clear();
    }

    public override bool IsReadOnly { get { return Collection.IsReadOnly; } }

    public override bool Remove(TOut item)
    {
        return Collection.Remove(ToInner(item));
    }

    public override bool Contains(TOut item)
    {
        return Collection.Contains(ToInner(item));
    }
}

public abstract class BaseConvertingCollection<TIn, TOut, TCollection> : ICollection<TOut>
where TCollection : ICollection<TIn>
{
    readonly Func<TCollection> getCollection;
    readonly Func<TIn, TOut> toOuter;

    public BaseConvertingCollection(Func<TCollection> getCollection, Func<TIn, TOut> toOuter)
    {
        if (getCollection == null || toOuter == null)
            throw new ArgumentNullException();
        this.getCollection = getCollection;
        this.toOuter = toOuter;
    }

    protected TCollection Collection { get { return getCollection(); } }

    protected TOut ToOuter(TIn inner) { return toOuter(inner); }

    #region ICollection<TOut> Members

    public abstract void Add(TOut item);

    public abstract void Clear();

    public virtual bool Contains(TOut item)
    {
        var comparer = EqualityComparer<TOut>.Default;
        foreach (var member in Collection)
            if (comparer.Equals(item, ToOuter(member)))
                return true;
        return false;
    }

    public void CopyTo(TOut[] array, int arrayIndex)
    {
        foreach (var item in this)
            array[arrayIndex++] = item;
    }

    public int Count { get { return Collection.Count; } }

    public abstract bool IsReadOnly { get; }

    public abstract bool Remove(TOut item);

    #endregion

    #region IEnumerable<TOut> Members

    public IEnumerator<TOut> GetEnumerator()
    {
        foreach (var item in Collection)
            yield return ToOuter(item);
    }

    #endregion

    #region IEnumerable Members

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }

    #endregion
}

You can now populate your db directly as follows:

        var rootDTO = new RootObjectDTO
        {
            results = new ConvertingCollection<UserImageList, JsonUserImageDTO>(() => db.UserImageLists, (x) => { throw new NotImplementedException(); }, (x) => x.ToDataModel())
        };

        using (var stream = File.Open(path, FileMode.Open))
        using (var reader = new StreamReader(stream))
        {
            JsonSerializer.CreateDefault().Populate(reader, rootDTO);
        }

By populating a preallocated rootDTO and ConvertingCollection<UserImageList, JsonUserImageDTO>, your db.UserImageLists will get populated with the contents of the JSON with fewer intermediate representations.

dbc
  • 104,963
  • 20
  • 228
  • 340
0

You can create the objects and then deserialize them. Example:

JsonConvert.DeserializeObject<RootObject>(jsonString);

public class Asdf
{
    public bool read { get; set; }
    public bool write { get; set; }
}

public class RoleAdmin
{
    public bool read { get; set; }
}

public class ACL
{
    public Asdf asdf { get; set; }
    public RoleAdmin { get; set; }
}

public class Result
{
    public ACL ACL { get; set; }
    public string REID { get; set; }
    public string createdAt { get; set; }
    public string email { get; set; }
    public string objectId { get; set; }
    public string updatedAt { get; set; }
    public int urlCount { get; set; }
    public List<string> urlList { get; set; }
}

public class RootObject
{
    public List<Result> results { get; set; }
}