0

I am working on a pandas dataframe, business, which was created by reading a csv file in Python. The csv file contains data from the website Yelp.com. One of the columns called attributes has a nested data structure. The length of rows in these columns vary. The following will hopefully make that clear:

business.attributes[0]
>"BikeParking: True,BusinessAcceptsBitcoin: False,BusinessAcceptsCreditCards: True,BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False},DogsAllowed: False,RestaurantsPriceRange2: 2,WheelchairAccessible: True"

business.attributes[1]
>'BusinessAcceptsBitcoin: False,BusinessAcceptsCreditCards: True'

business.attributes[2]
>"Alcohol: none,Ambience: {'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': False},BikeParking: True,BusinessAcceptsCreditCards: True,BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': False, 'valet': False},Caters: True,GoodForKids: True,GoodForMeal: {'dessert': False, 'latenight': False, 'lunch': False, 'dinner': False, 'breakfast': False, 'brunch': False},HasTV: True,NoiseLevel: quiet,OutdoorSeating: False,RestaurantsAttire: casual,RestaurantsDelivery: True,RestaurantsGoodForGroups: True,RestaurantsPriceRange2: 1,RestaurantsReservations: False,RestaurantsTableService: False,RestaurantsTakeOut: True,WiFi: free"

I would like to insert multiple columns in the business dataframe and capture values under each attribute in a separate column. Capturing each row in the attribute column gives me a string, but I am struggling to split it because of various types of delimiters and the nested structure. I know that there are 34 different attributes in the dataset with 6 of them having nested values. A sample of an expected output for third row in the above example is in the image below:

Sample output

Any suggestions in Python or R are welcome to do this task. Please help!!

Rnovice
  • 333
  • 1
  • 5
  • 18
  • The data is almost a json. Maybe you could pre-process the file to make it json and then it would be easy to create all the columns navigating through the json structure. – ednincer Aug 10 '17 at 18:44
  • @ednincer I actually got the csv file by from a json file using a code available in Github to convert json to csv. I thought once it is in csv format, it will be easier to manipulate data in Python pandas. I don't know how to navigate through the json structure. – Rnovice Aug 10 '17 at 19:30
  • Then you should use the json file directly. Use the json library to decode the json file and you would get the data structure in an array. Check this: https://stackoverflow.com/questions/7771011/parse-json-in-python – ednincer Aug 10 '17 at 19:37

0 Answers0