0

I have a DataFrame like this:

| json_col                                           |
| ---------------------------------------------------|
| {"category":"a","items":["a","b","c","d","e","f"]} |
| {"category":"b","items":["u","v","w","x","y"]}     |
| {"category":"c","items":["p","q"]}                 |
| {"category":"d","items":["m"]}                     |

I converted it to strings of dicts:

x = pd.Series(', '.join(df_list['json_col'].to_list()), name='text')

The resultant is like below:

'{"category":"a","items":["a","b","c","d","e","f"]},
{"category":"b","items":["u","v","w","x","y"]},
{"category":"c","items":["p","q"]},
{"category":"d","items":["m"]}'

(EDIT: This was my original input when I posted the question but I have been pointed that it is not a right way to use JSON so I am providing the dataframe above.)

I am required to write a python function that takes an item as an input and return the top 3 items from the list where it belongs to (excluding itself). Items are in sequence of priority so top 3 is top first items.

def item_list(above_json_input, item = "a"):
    return list

For example the result list should follow the following rules:

  1. If the item is "a" then iterate through category - a where item a is present and return top 3 items in the sequence - ["b","c","d"]
  2. If the item is "w" then then iterate through category - b where item w is there and return - ["u","v","x"]
  3. If the item is "q" then look in category - c where item q is there and return - ["p"] because there are less than 3 top items other than q
  4. If the item is "m" then the returned list should look in category d where item q is there and return empty [] because there are no other items in that list to look for top items.

Same goes with an item which doesn't exist like item = "r" which is not there in any category. We can throw an error or return an empty list again.

I am not sure how to read the json and get the list of top items. Is this even possible?

trojan horse
  • 347
  • 2
  • 10
  • your json file contains string or dictionary – omar Jul 21 '22 at 00:23
  • it is string of dicts with each dict have values made of lists that I need to search on – trojan horse Jul 21 '22 at 00:29
  • Break up into two parts: dealing with JSON and then applying the rules. First thing is starting off with valid JSON -- if it's supposed to be a list of categories, it's missing surrounding `[`/`]`, then use the `json` package in stdlib to parse the string. For the second part, please ask a more specific question. What have you tried, and what specific error are you blocked on? – Kache Jul 21 '22 at 00:48
  • Actually the JSONs I shared are records in each row of the dataframe. Like a column named - "json" with 4 rows with each row one category and items I shared. Can we use it with the dataframe directly? – trojan horse Jul 21 '22 at 01:08

2 Answers2

1

I fixed your JSON, as it was badly formatted. For input "c", ['a', 'b', 'd'] and ['p', 'q'] are printed:

import json

data_string = """{
        "data" : [
                {"category":"a","items":["a","b","c","d","e","f"]},
                {"category":"b","items":["u","v","w","x","y"]},
                {"category":"c","items":["p","q"]},
                {"category":"d","items":["m"]}
        ]
}"""

data = json.loads(data_string)["data"]

user_input = input("Pick a letter: ")

found = False
for values in data:
        if user_input in (values["category"], *values["items"]):
                found = True
                temp = [item for item in values["items"] if item != user_input]
                print(temp[:3])

if not found:
        print([])
Jonathan Ciapetti
  • 1,261
  • 3
  • 11
  • 16
  • But the JSON is similar to what I shared but that is because I created it out of a dataframe by combining all rows. Can it be done on a dataframe where each row is a json like row 1 = {"category":"a","items":["a","b","c","d","e","f"]}, row 2 = {"category":"b","items":["u","v","w","x","y"]} ....and I want to read that column and then check for those records and return the list. Did I make mistake by making all the rows as a single string of jsons. – trojan horse Jul 21 '22 at 01:12
  • I shared him above the original input as a Dataframe too – trojan horse Jul 21 '22 at 01:15
  • can you check the new inputs – trojan horse Jul 21 '22 at 01:20
  • Well, I read "JSON input" in the question so I treated like a JSON, and [here](https://json.org/example.html) you can see examples of JSONs. If you get your data from a Pandas DataFrame, sure you can still get the same results, but it would be a different question, I'm not being harsh, I just think that it would be not ok with the rules of SO, and others can answer that better than me. Btw it would be just a matter of using the DataFrame data instead of the JSON one. – Jonathan Ciapetti Jul 21 '22 at 01:29
  • @trojanhorse Also, when you edit your question and people have already given some answers, I think it would be better to explicitly write that you edited, or to the mods those answers (like mine) will seem odd. – Jonathan Ciapetti Jul 21 '22 at 01:35
  • Apologies, I didnt mean to confuse. When I was trying it as a JSON based on some other SO answer and got stuck and then you suggested the JSON format was wrong. Let me make it clear that I edited the question and accept your answer because it works. I just need it the same for dataframe. Can you suggest how do I read the JSON rows get the data part. I am getting error for that – trojan horse Jul 21 '22 at 02:41
  • Thank you, it's ok, no big deal. I read [here](https://stackoverflow.com/questions/20037430/reading-multiple-json-records-into-a-pandas-dataframe) that you can do it this way: 1) delete the ',' at the end of each row in `data_string`, 2) use `data = pd.read_json(data_string, lines=True)` . From that point forward, it's just you and the DataFrame, see how other answers implement the algorithm with the DataFrame as input. – Jonathan Ciapetti Jul 21 '22 at 03:01
0

You could try this on your dataframe:

import pandas as pd

df = pd.DataFrame({'jsonCol':[{"g":[]}]})
h = df['jsonCol']


def search(inm):
    for item in h:
        if inm in item['items']:
            if len(item['items'])>3:
                item['items'].pop(item['items'].index(inm))
                return item['items'][:3]
            if len(item['items'])<3:
                item['items'].pop(item['items'].index(inm))
                return item['items']
    return []
        
print(search('r'))

edit:

h = [{"category":"a","items":["a","b","c","d","e","f"]},{"category":"b","items":["u","v","w","x","y"]},{"category":"c","items":["p","q"]},{"category":"d","items":["m"]}]

def search(inm):
    for item in h:
        if inm in item['items']:
            if len(item['items'])>3:
                item['items'].pop(item['items'].index(inm))
                return item['items'][:3]
            if len(item['items'])<3:
                item['items'].pop(item['items'].index(inm))
                return item['items']
    return []
        
print(search('b'))  # answer ['a', 'c', 'd']
omar
  • 258
  • 2
  • 6