0

working on a database file and have exported a part of it for illustration. One of the column has very lengthy text data which needs to be grouped into different categories for further analysis. i have looked through code blocks here and here as well but not the solution i was looking for. The text column has values such as:

  1. First example of text
{
   "23":{
      "ClassifierName":"CP!nativeads!PA!pconv!0604-0701.ini",
      "CountModelIds":"1190"
   },
   "61":{
      "ClassifierName":"CP!nativeads!roas!pconv!PA!0601-0628!remove_fea.ini",
      "CountModelIds":"1185;1186"
   },
   "63":{
      "ClassifierName":"CP!nativeads!roas!gv!v2!PA!NN!0601-0628.ini",
      "CountModelIds":"1187;1188;1189"
   },
   "65":{
      "ClassifierName":"CP!en-us!Cascade!PA!ROAS_ScorePrediction0628Update.ini"
   }
}
  1. Another example of text
{"23":{"ClassifierName":"CP!nativeads!PA!pconv!0604-0701.ini","CountModelIds":"1190"},"61":{"ClassifierName":"CP!nativeads!roas_pconv!PA!NN!4weektill20200928.ini","CountModelIds":"1178;1179"},"63":{"ClassifierName":"CP!nativeads!roas_gv!PA!NN!withTA4weeksTill20200929.ini","CountModelIds":"1180;1181;1182"},"65":{"ClassifierName":"ROASScorePrediction.ini"}}

Below is what I have tried, as I wasn't able to think through. this was a very rudimentary approach where I was adding new columns based on each match however I want to add one new column and it should have sub categories or groups in this new column. Please note this entire string needs to be matched and a portion.

import pandas as pd
df= pd.read_excel('./sourcefile - 2023-03-10T161929.976.xlsx')
df['Textcolumn_name'].unique().size
df['mask1'] = (df['Textcolumn_name'].str.contains('{"23":{"ClassifierName":"CP!nativeads!PA!pconv!0604-0701.ini","CountModelIds":"1190"},"61":{"ClassifierName":"CP!nativeads!roas!pconv!PA!0601-0628!remove_fea.ini","CountModelIds":"1185;1186"},"63":{"ClassifierName":"CP!nativeads!roas!gv!v2!PA!NN!0601-0628.ini","CountModelIds":"1187;1188;1189"},"65":{"ClassifierName":"CP!en-us!Cascade!PA!ROAS_ScorePrediction0628Update.ini"}}',case=False,na=False))
df['mask2'] = (df['Textcolumn_name'].str.contains('{"23":{"ClassifierName":"CP!nativeads!PA!pconv!0604-0701.ini","CountModelIds":"1190"},"61":{"ClassifierName":"CP!nativeads!roas_pconv!v2!PA!NN!OfferSelectionMask!roasonly4weektill20210707.ini","CountModelIds":"1185;1186"},"63":{"ClassifierName":"CP!nativeads!roas_gv!v2!PA!NN!withTA4weeksTill20210707.ini","CountModelIds":"1187;1188;1189"},"65":{"ClassifierName":"ROASScorePrediction.ini"}}', case= False, na=False))

With this 2 new columns are created and where there is an exact match it returns true/false as Boolean operator.

I need if text is as shown in example 1 then Cat= 1 or A whereas if text is as shown in example 2 then cat= 2 or B. Like this there are 6-8 type of different text types.

Shivi Bhatia
  • 157
  • 4
  • 22

0 Answers0