0

I have a file that I wish to parse. It has data in the json format, but the file is not a json file. I want to loop through the file, and pull out the ID where totalReplyCount is greater than 0.

  {  "totalReplyCount": 0,
       "newLevel":{ 
           "main":{  
              "url":"http://www.someURL.com",
              "name":"Ronald Whitlock",
              "timestamp":"2016-07-26T01:22:03.000Z",
              "text":"something great"
              },
       "id":"z12wcjdxfqvhif5ee22ys5ejzva2j5zxh04"
    }
},
    {  "totalReplyCount": 4,
        "newLevel":{ 
           "main":{  
              "url":"http://www.someUR2L.com",
              "name":"other name",
              "timestamp":"2016-07-26T01:22:03.000Z",
              "text":"something else great"
             },
       "id":"kjsdbesd2wd2eedd23rf3r3r2e2dwe2edsd"
    }
},

My initial attempt was to do the following

def readCsv(filename):
    with open(filename, 'r') as csvFile:
        for row in csvFile["totalReplyCount"]:
            print row

but I get an error stating

TypeError: 'file' object has no attribute 'getitem'

I know this is just an attempt at printing and not doing what I want to do, but I am a novice at python and lost as to what I am doing wrong. What is the correct way to do this? My end result should look like this for the ids:

['insdisndiwneien23e2es', 'lsndion2ei2esdsd',....]

EDIT 1- 7/26/16

I saw that I made a mistake in my formatting when I copied the code (it was late, I was tired..). I switched it to a proper format that is more like JSON. This new edit properly matches file I am parsing. I then tried to parse it with JSON, and got the ValueError: Extra data: line 2 column 1 - line X column 1:, where line X is the end of the line.

 def readCsv(filename):
        with open(filename, 'r') as file:
            data=json.load(file)
            pprint(data)

I also tried DictReader, and got a KeyError: 'totalReplyCount'. Is the dictionary un-ordered?

EDIT 2 -7/27/16

After taking a break, coming back to it, and thinking it over, I realized that what I have (after proper massaging of the data) is a CSV file, that contains a proper JSON object on each line. So, I have to parse the CSV file, then parse each line which is a top level, whole and complete JSON object. The code I used to try and parse this is below but all I get is the first string character, an open curly brace '{' :

def readCsv(filename):
    with open(filename, 'r') as csvfile:
        for row in csv.DictReader(csvfile):
            for item in row:
                print item[0]

I am guessing that the DictReader is converting the json object to a string, and that is why I am only getting a curly brace as opposed to the first key. If I was to do print item[0:5] I would get a mish mash of the first 4 characters in an un-ordered fashion on each line, which I assume is because the format has turned into an un-ordered list? I think I understand my problem a little bit better, but still wrapping my head around the data structures and the methods used to parse them. What am I missing?

unseen_damage
  • 1,346
  • 1
  • 14
  • 32
  • 1
    You are trying to use `[]` on a file object it doesnt support it. Also the data you are reading does not look like a csv. – Paul Rooney Jul 26 '16 at 04:07
  • 3
    How did this abomination of a file even come into existence? It's so broken, I'm not sure how you intend to parse it at all. – Aran-Fey Jul 26 '16 at 04:12
  • You only want to get the ids? Nothing else? – Aran-Fey Jul 26 '16 at 04:13
  • Are you sure the file isn't valid json? It looks like it might be, if you are only posting a part of it, are you? Whatever the case, the CSV module is definitely not what you are going to need. – juanpa.arrivillaga Jul 26 '16 at 04:19
  • @Rawing - yes this is an abomination. It is what it is. I only want the ID's where the totalReplyCount is greater than zero. – unseen_damage Jul 26 '16 at 14:09
  • @juanpa.arrivillaga - I am not using the CSV module, I just gave the file the variable name csvFile because the values I want to read are comma separated as shown. Each "object" is on its own line, I just expanded it for readability. – unseen_damage Jul 26 '16 at 14:09

5 Answers5

1

After reading the question and all the above answers, please check if this is useful to you.

I have considered input file as simple file not as csv or json file.

Flow of code is as follow:

  • Open and read a file in reverse order.
  • Search for ID in line. Extract ID and store in temp variable.
  • Go on reading file line by line and search totalReplyCount.
  • Once you got totalReplyCount, check it if it greater than 0.
  • If yes, then store temp ID in id_list and re-initialize temp variable.
import re
tmp_id_to_store = ''
id_list = []
for line in reversed(open("a.txt").readlines()):
    m = re.search('"id":"(\w+)"', line.rstrip())
    if m:
        tmp_id_to_store = m.group(1)
    n = re.search('{  "totalReplyCount": (\d+),', line.rstrip())
    if n:
        fou = n.group(1)
        if int(fou) > 0:
            id_list.append(tmp_id_to_store)
            tmp_id_to_store = ''
print id_list

More check points can be added.

Dinesh Pundkar
  • 4,160
  • 1
  • 23
  • 37
0

As the error stated, Your csvFile is a file object, it is not a dict object, so you can't get an item out of it.

if your csvFile is in CSV format, you can use the csv module to read each line of the csv into a dict :

import csv
with open(filename) as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print row['totalReplyCount']

note the DictReader method from the csv module, it will read your csv line and parse it into dict object

Chrim
  • 100
  • 8
0

If your input file is JSON why not just use the JSON library to parse it and then run a for loop over that data. Then it is just a matter of iterating over the keys and extracting data.

import json
from pprint import pprint

with open('data.json') as data_file:    
    data = json.load(data_file)

pprint(data)

Parsing values from a JSON file using Python?

Look at Justin Peel's answer. It should help.

Community
  • 1
  • 1
  • I did try iterating over the data in json, but got errors because it is not in the proper json format. I tried to parse the data and write to a file, but the error I got was TypeError: Expected a character buffer object... – unseen_damage Jul 26 '16 at 14:12
  • @unseen_damage Are you sure your JSON is formatted correctly? Try using this to check first: https://jsonformatter.curiousconcept.com/ –  Jul 26 '16 at 17:02
  • The json is not formatted properly, although it is close to json. The file output is basically like such: `{ item: 0, { item 2: {item 3: xxx, item4: xxx} item5: xxx } }, { item: 0, { item 2: {item 3: xxx, item4: xxx} item5: xxx } }`, – unseen_damage Jul 27 '16 at 18:46
0

Parsing values from a JSON file in Python , this link has it all @ Parsing values from a JSON file using Python? via stackoverflow.

Community
  • 1
  • 1
Gurpreet Singh
  • 1,641
  • 3
  • 17
  • 29
0

Here is a shell one-liner, should solve your problem, though it's not python.

egrep -o '"(?:totalReplyCount|id)":(.*?)$' filename | awk '/totalReplyCount/ {if ($2+0 > 0) {getline; print}}' | cut -d: -f2

output:

"kjsdbesd2wd2eedd23rf3r3r2e2dwe2edsd"
duyue
  • 759
  • 5
  • 12