2

I have data in a pandas dataframe which looks like this:

queryName   Market  tags    categoryDetails
dummy_query (dummy_market)  dummy_market    dummy_tag   [{'name': 'relevant_data', 'parentName': 'relevant_scrape', 'parentId': '289245228', 'id': '2892695401'}, {'name': 'relevant_data', 'parentName': 'relevant_scrape', 'parentId': '289245228', 'id': '21892718'}, {'name': 'dummy_data', 'parentName': 'Location', 'parentId': '21221517840', 'id': '229565351'}]
dummy_query (dummy_market)  dummy_market    dummy_tag   [{'name': 'relevant_data', 'parentName': 'relevant_scrape', 'parentId': '289245228', 'id': '2892659'}, {'name': 'relevant_data', 'parentName': 'relevant_scrape', 'parentId': '289245228', 'id': '2892667'}, {'name': 'irrelevant_data', 'parentName': 'irrelevant_scrape', 'parentId': '2662610', 'id': '268415777'}, {'name': 'dummy_data', 'parentName': 'Location', 'parentId': '21221517840', 'id': '2565351'}]
dummy_query (dummy_market)  dummy_market    dummy_tag   [{'name': 'relevant_data', 'parentName': 'relevant_scrape', 'parentId': '289245228', 'id': '2892695401'}, {'name': 'irrelevant_data', 'parentName': 'irrelevant_scrape', 'parentId': '2662610', 'id': '268415777'}, {'name': 'dummy_data', 'parentName': 'Location', 'parentId': '21221517840', 'id': '229565351'}, {'name': 'Consideration', 'parentName': 'irrelevant_scrape', 'parentId': '2203873', 'id': '2203874'}]
dummy_query (dummy_market)  dummy_market    dummy_tag   [{'name': 'relevant_data', 'parentName': 'relevant_scrape', 'parentId': '289245228', 'id': '2892695401'}, {'name': 'irrelevant_data', 'parentName': 'irrelevant_scrape', 'parentId': '2662610', 'id': '268415777'}, {'name': 'dummy_data', 'parentName': 'Location', 'parentId': '21221517840', 'id': '229565351'}]
dummy_query (dummy_market)  dummy_market    dummy_tag   [{'name': 'relevant_data', 'parentName': 'relevant_scrape', 'parentId': '289245228', 'id': '21892718'}, {'name': 'irrelevant_data', 'parentName': 'irrelevant_scrape', 'parentId': '2662610', 'id': '268415777'}, {'name': 'dummy_data', 'parentName': 'Location', 'parentId': '21221517840', 'id': '229565351'}]
dummy_query (dummy_market)  dummy_market    dummy_tag   [{'name': 'relevant_data', 'parentName': 'relevant_scrape', 'parentId': '289245228', 'id': '2892659'}, {'name': 'dummy_data', 'parentName': 'Location', 'parentId': '21221517840', 'id': '229565351'}, {'name': 'dummy_data', 'parentName': 'irrelevant_scrape', 'parentId': '2203873', 'id': '2203880'}]

I need my dataframe to have an additional, fifth column which will contain all the name keys with the elements called "relevant_data" of each row. These datapoints are selected based off of the parentName. If parentName = 'relevant_scrape', select "name."

How should I go about doing this? Here is my code so far.

import pandas as pd
import json
from pandas import DataFrame, read_csv

df = pd.read_csv('dataset.csv', sep = '\t')
for row in df.categoryDetails:
    if isinstance(row, str):
        list_dicts = json.loads(row.replace("'", "\""))
        for each_dict in list_dicts:
            if each_dict["parentName"] == "relevant_scrape":
                df['fifth_column'] = each_dict["name"]

df.to_csv('output.txt', sep = '\t')

(Note: my original data is a bit messy and couldn't be rendered as JSON until I replaced its quotation marks with double quotation marks. Hence the json.loads call.)

This produces for me a dataframe with a fifth column, but it inserts the same exact "name" element in each row. Any and all help is appreciated, thank you.

dataelephant
  • 563
  • 2
  • 7
  • 21

1 Answers1

1

You are using df['fifth_column'] = each_dict["name"], which sets all values in the 'fifth_column' column to the same value each iteration, since pandas' operations are column wise by default.

Maybe you should try the following snippet:

def extract_details(row):
    # your parsing logic.
    if isinstance(row, str):
        list_dicts = json.loads(row.replace("'", "\""))
        all_relevant_data = []
        for each_dict in list_dicts:
            if each_dict["parentName"] == "relevant_scrape":
                all_relevant_data.append(each_dict["name"])
        return ','.join(all_relevant_data)

and then you could do:

df['fifth_column'] = df.categoryDetails.apply(extract_details)
OmerBA
  • 792
  • 8
  • 13
  • I got: "TypeError: 'str' object does not support item assignment" when I replaced column with row. – dataelephant Jul 07 '16 at 19:59
  • This executed, but all the data in that column (fifth column) is being duplicated from the first piece of "relevant_data." Any ideas? – dataelephant Jul 07 '16 at 20:35
  • 1
    This is what the code you posted does. all the first dicts have name=relevant_data in the data. – OmerBA Jul 07 '16 at 20:53
  • I had to change the data for privacy reasons. Each "relevant_data" point can be one of ten different strings. I need the code to apply each snagged data point onto the fifth column and also only to those rows which have that data- right now this code actually appends it to every single row. Thank you again! Upvoting your answer and comments. – dataelephant Jul 07 '16 at 20:55
  • 1
    No problem :). if you want all the 'relevant_data' fields you might want to keep a list (and append relevant data from each dict to it) and then return the list in the inner function. i'll edit the answer again – OmerBA Jul 07 '16 at 20:58
  • Worked like a charm. Thank you. – dataelephant Jul 08 '16 at 14:39
  • Actually....what if I wanted, instead of having two "relevant_data" fields in the fifth column separated by a comma, to have only one in that column per row and then duplicate the row so that the second "relevant_data" point was in the fifth column cell for that row? – dataelephant Jul 08 '16 at 15:32
  • 1
    take a look at http://stackoverflow.com/questions/29058147/flattening-an-array-in-pandas?noredirect=1&lq=1 – OmerBA Jul 08 '16 at 18:49