1

I have a dataset that has a variable, NAICS Industry, represented by a 6 digit #, I want to get this # narrowed down to the first two digits, so I can combine industries for a broader view. After I get the industry # narrowed down to two digits instead of 6; I want to use value counts to count the total # of loans that fall within that NAICS industry code. Can someone please help. I have attached pictures for reference.

Reference of NAICS Industry codes; as you can see some of the codes have the same first two digits; I want to group these codes under one broader subgroup to get the total # of loans within that one industry.

1 Answers1

0

The best approach depends on the data type of the NAICS data (which I can't tell from the screenshot alone) and assumptions about the number of digits.

Assuming that the dataset contains only six-digit NAICS codes in integer format (that is, df['NAICS'].dtype is int64 or similar), the first two digits can be obtained by dividing the NAICS code by 10000 using integer division:

df['NAICS_sector'] = df['NAICS'] // 10000

Note that you must use // (integer division) and not / (floating-point division).

If the NAICS codes are in the dataframe in string format (that is, df['NAICS'].dtype says object), you can use string manipulation instead:

df['NAICS_sector'] = df['NAICS'].str.slice(stop=2)

Setting stop=2 means that the first two characters are returned from each entry. The parameters of the slice method are explained in the official Pandas documentation.

Finally, if your dataset contains integers but you cannot guarantee they all have the same length, you'll want to use string manipulation anyway, by converting the column to a string and then using the second sample.

After all this is done, you can group using the new NAICS_sector column.

nanofarad
  • 40,330
  • 4
  • 86
  • 117
  • 1
    Thank you so much; I hope to be just as great as you one day!! You solved this frustrating problem for me in no time; I appreciate it. Have a great day! :) Btw, the data type was int64, so the first solution worked great. Thanks again. – DopeNAnalytical Aug 21 '22 at 17:08