2

I have a column with a bunch of ingredients lists in it. I'm trying to figure out how many times different individual ingredients appear. There are 73,000 rows. The answers on this question works for a small amount of data in Google Sheets.

Formula is =UNIQUE(TRANSPOSE(SPLIT(JOIN(", ";A2:A);", ";FALSE)))

But I've overwhelmed JOIN with more than 50000 characters here. Is there another way to tackle this?

Sheet: https://docs.google.com/spreadsheets/d/1t0P9hMmVpwhI2IbATmIMjobuALTg8VWhl8-AQaq3zIo/edit?usp=sharing

enter image description here

player0
  • 124,011
  • 12
  • 67
  • 124
Kyle Pennell
  • 5,747
  • 4
  • 52
  • 75

5 Answers5

1
=UNIQUE(TRANSPOSE(SPLIT(REGEXREPLACE(TRANSPOSE(
 QUERY(ARRAYFORMULA(","&A1:A),,5000000))," ,",","),",")))

0

player0
  • 124,011
  • 12
  • 67
  • 124
1

but maybe you need this (?):

=QUERY(TRANSPOSE(SPLIT(REGEXREPLACE(TRANSPOSE(
 QUERY(ARRAYFORMULA(","&A1:A),,5000000))," ,",","),",")),
 "select Col1,count(Col1) 
  where Col1 is not null 
  group by Col1 
  label count(Col1)''")

0

player0
  • 124,011
  • 12
  • 67
  • 124
1
=ARRAYFORMULA(UNIQUE(TRIM(TRANSPOSE(SPLIT(TRANSPOSE(
 QUERY(","&A1:A,,5000000)),",")))))

0


=QUERY(QUERY(ARRAYFORMULA(TRIM(TRANSPOSE(SPLIT(TRANSPOSE(
 QUERY(","&A1:A,,5000000)),",")))),
 "select Col1,count(Col1) 
  where Col1 is not null 
  group by Col1 
  label count(Col1)''"),
  "order by Col2 desc")

0

demo spreadsheet

player0
  • 124,011
  • 12
  • 67
  • 124
1

I did a google scripting solution because I wanted to play with key map pairs.

function myFunction() {
  var myMap = {"candy":0};
    var sh = SpreadsheetApp.getActiveSpreadsheet();
    var ss = sh.getSheetByName("FIRSTSHEETNAME");
    var os = sh.getSheetByName("Ingredients");
    var data = ss.getDataRange().getValues();
    for (var i=0; i<data.length;i++)//full
    //for (var i=1; i<4000;i++)//test
      {
      var array = data[i][0].split( ",");
      for (var j=0; j<array.length;j++)
        {
          var item = array[j];
          //Logger.log(array[j]);
          if (myMap[item]>-1){
            //Logger.log("REPEAT INGREDIENT");
            var num = parseInt(myMap[item]);
            num++;
            myMap[item]=num;
            //Logger.log(item +" "+num);
          } else {
            myMap[item]=1;
            //Logger.log("New Ingredient: "+item);
            //Logger.log(myMap);
          }

        }
      }
  //Logger.log(myMap);
  var output=[];
  for (var key in myMap){
    //Logger.log("Ack");
    output.push([key,myMap[key]]);


  }
  //Logger.log(output);
  os.getRange(2,1,output.length,output[0].length).setValues(output);

}

You'll need to add an "Ingredients" tab for the output and change your first tab to be called FIRSTSHEETNAME (or change the code). In my testing it took 4 seconds for 4 items, 5 seconds for 400 items, and 6 seconds for 4000 items. there might be an issue with leading spaces but this gives you a place to start.

J. G.
  • 1,922
  • 1
  • 11
  • 21
  • I just counted them... there are **1,965,918** items which could take more than 30 seconds (?) in which case script will fail before it's completed – player0 Mar 13 '19 at 23:37
  • Did you try it? The script limit is 5 minutes now for non corporate accounts, 30 minutes for corporate accounts. and by items I meant rows. – J. G. Mar 14 '19 at 16:21
  • I didn't, did you? 5/30min... interesting. I heard 30sec... good to know then. as for the rows, there are: **98,268** – player0 Mar 14 '19 at 17:39
1

A fast running formula that works with columns of at least 40,000 rows:

 =query(arrayformula(TRIM(flatten(split(A2:A20000,",")))),"select Col1,Count(Col1) Where NOT (Col1='' OR Col1 contains '#VALUE!') Group By Col1 order by Count(Col1) desc label Col1 'Ingredient',Count(Col1) 'Freq.'")

FLATTEN function, combined with SQL (QUERY function) can be a solution for fast filtering of values (such as empty or error messages).

TRIM function avoids artifacts in the result due to meaningless spaces before/after each string.

Sheet: https://docs.google.com/spreadsheets/d/1m9EvhQB1Leg2H7L52WhPe66_jRrTc8VsnZcliQsxJ7s/edit?usp=sharing

*In case of false case differences, you could normalize all characters of the strings to uppercase before within the same formula with UPPER(A2:A20000).

EBL
  • 71
  • 5