0

I have downloaded a CSV file from Google Trends which presents data in this format:

Top cities for golden globes
City,golden globes
New York (United States),100
Los Angeles (United States),91
Toronto (Canada),69

Top regions for golden globes
Region,golden globes
United States,100
Canada,91
Ireland,72
Australia,72

There are 3-4 of these groups separated by whitespace. The first line of each group contains text I want to use as a key, followed by a list of dictionaries I need associated with that key. Does anyone have any advice on some Python tools I could use to make this happen? I'm not having much luck with Python's CSV library.

My desired output from the above CSV would look like this:

{
"Top cities for golden globes" :
   {
      "New York (United States)" : 100,
      "Los Angeles (United States)" : 91,
      "Toronto (Canada)" : 69
   },
"Top regions for golden globes" :
   {
      "United States" : 100,
      "Canada" : 91,
      "Ireland" : 72,
      "Australia" : 72
   }
}
  • 1
    You're desired output doesn't look like JSON to me ;-) -- e.g. `{"New York (United States)", 100}` – mgilson Jan 14 '15 at 00:23
  • possible duplicate of [Python CSV to JSON](http://stackoverflow.com/questions/19697846/python-csv-to-json) – João Paulo Jan 14 '15 at 00:27
  • apologizes for improper formatting, please see edit –  Jan 14 '15 at 00:32
  • @CollinSchupman, is this an one-off task? – Anzel Jan 14 '15 at 00:46
  • @Anzel, no, it's part of a dynamic process. every day or so, the system pulls this information for the top 10 trending words and puts into a new JSON format –  Jan 14 '15 at 00:56
  • @CollinSchupman, why couldn't you just pull the data from Google Trend in JSON format to start with? – Anzel Jan 14 '15 at 01:05
  • @Anzel, the available (unofficial, there appears to be no official API) API pulls down the data in CSV format, probably because that's what you can get directly from the website. Do you know an API that will give you JSON from it? –  Jan 14 '15 at 01:07
  • Google Trend data is embedded in the page source anyway that's why you can **view** the results. Simply use **requests** + html parser (**lxml**) for example, to get the page source, and you only need to parse the bit with ** – Anzel Jan 14 '15 at 01:18

2 Answers2

0

Your input format is so expectable that I would do it by hand, without a CSV library.

import json
from collections import defaultdict

fh = open("yourfile.csv")
result = defaultdict(dict) #dictionary holding the data
current_key = "" #current category
ignore_next = False #flag to skip header

for line in fh:
    line = line.strip() #throw away newline
    if line == "": #line is empty
        current_key = ""
        continue
    if current_key == "": #current_key is empty
        current_key = line #so the current line is the header for the following data
        ignore_next = True
        continue
    if ignore_next: #we're in a line that can be ignored
        ignore_next = False
        continue
    (a,b) = line.split(",")
    result[current_key][a] = b
fh.close()

#pretty-print data
print json.dumps(result, sort_keys=True, indent=4)
L3viathan
  • 26,748
  • 2
  • 58
  • 81
0

I'd try something like...:

row = []
dd = {}
with open('the.csv') as f:
    r = csv.reader(f)
    while True:
        if row:  # normal case, non-empty row
            d[row[0]] = row[1]
            row = next(r, None)
            if row is None: break
        else:  # row is empty at start and after blank line
            category = next(f, None)
            if category is None: break
            category = category.strip()
            next(r)  # skip headers row
            d = dd[category] = {}
            row = next(r, None)
            if row is None: break

Now, dd should be the dict-of-dicts you want, and you can json.dump it as you wish.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395