Parsing semi-structured json data(Python/R)

Question

I'm not good with regular expressions or programming.

I have my data like this in a text file:

RAMCHAR@HOTMAIL.COM (): 
PATTY.FITZGERALD327@GMAIL.COM ():
OHSCOACHK13@AOL.COM (19OB3IRCFHHYO): [{"num":1,"name":"Bessey VAS23 Vario Angle Strap Clamp","link":"http:\/\/www.amazon.com\/dp\/B0000224B3\/ref=wl_it_dp_v_nS_ttl\/181-6441163-6563619?_encoding=UTF8&colid=37XI10RRD17X2&coliid=I1YMLERDXCK3UU&psc=1","old-price":"N\/A","new-price":"","date-added":"October 19, 2014","priority":"","rating":"N\/A","total-ratings":"","comment":"","picture":"http:\/\/ecx.images-amazon.com\/images\/I\/51VMDDHT20L._SL500_SL135_.jpg","page":1},{"num":2,"name":"Designers Edge L-5200 500-Watt Double Bulb Halogen 160 Degree Wide Angle Surround Portable Worklight, Red","link":"http:\/\/www.amazon.com\/dp\/B0006OG8MY\/ref=wl_it_dp_v_nS_ttl\/181-6441163-6563619?_encoding=UTF8&colid=37XI10RRD17X2&coliid=I1BZH206RPRW8B","old-price":"N\/A","new-price":"","date-added":"October 8, 2014","priority":"","rating":"N\/A","total-ratings":"","comment":"","picture":"http:\/\/ecx.images-amazon.com\/images\/I\/5119Z4RDFYL._SL500_SL135_.jpg","page":1},{"num":3,"name":"50 Pack - 12"x12" (5) Bullseye Splatterburst Target - Instantly See Your Shots Burst Bright Florescent Yellow Upon Impact!","link":"http:\/\/www.amazon.com\/dp\/B00C88T12K\/ref=wl_it_dp_v_nS_ttl\/181-6441163-6563619?_encoding=UTF8&colid=37XI10RRD17X2&coliid=I31RJXFVF14TBM","old-price":"N\/A","new-price":"","date-added":"October 8, 2014","priority":"","rating":"N\/A","total-ratings":"67","comment":"","picture":"http:\/\/ecx.images-amazon.com\/images\/I\/51QwsvI43IL._SL500_SL135_.jpg","page":1},{"num":4,"name":"DEWALT DW618PK 12-AMP 2-1\/4 HP Plunge and Fixed-Base Variable-Speed Router Kit","link":"http:\/\/www.amazon.com\/dp\/B00006JKXE\/ref=wl_it_dp_v_nS_ttl\/181-6441163-6563619?_encoding=UTF8&colid=37XI10RRD17X2&coliid=I39QDQSBY00R56&psc=1","old-price":"N\/A","new-price":"","date-added":"September 3, 2012","priority":"","rating":"N\/A","total-ratings":"","comment":"","picture":"http:\/\/ecx.images-amazon.com\/images\/I\/416a5nzkYTL._SL500_SL135_.jpg","page":1}]

Could anybody suggest any easy way of separating this data into two columns(email id in the first column and json format data in the second column). Some rows might just have email id's(like in row 1) and no corresponding json data.

Please help. Thanks!

Hi, and welcome to StackOverflow. I have taken the liberty to reformat your post a little - I hope I didn't destroy the layout of your text file doing so. Could you check (and [edit](http://stackoverflow.com/posts/26987662/edit) the post if necessary)? Are the characters `1) `, `2) ` etc. really part of the file? — Tim Pietzcker, Nov 18 '14 at 06:04
Thank you for editing it. It looks perfectly okay. No, 1) 2) etc are not part of file, I have just added them to make it easier to differentiate rows. — warwick12, Nov 18 '14 at 06:10
Then they need to be removed - otherwise you'll get solutions that expect these numbers to be there, which then won't work on your actual data. Never change the structure of sample data. — Tim Pietzcker, Nov 18 '14 at 06:35

score 0 · Accepted Answer · answered Nov 18 '14 at 06:59

Please try the following solution (for Python 2). This assumes that each entry is on a single line (which means that there may be no linebreaks within the JSON substring). I've chosen in.txt as the filename for your data file - change that to the actual filename/path:

import csv
import re
regex = re.compile("""
    ([^:]*)  # Match and capture any characters except colons
    :[ ]*    # Match a colon, followed by optional spaces
    (.*)     # Match and capture the rest of the line""", 
    re.VERBOSE)
with open("in.txt") as infile, open("out.csv", "wb") as outfile:
    writer = csv.writer(outfile)
    for line in infile:
       writer.writerow(regex.match(line).groups())

Thanks for you help tim!! This answer works for me!! I really appreciate you help. Cheers!! — warwick12, Nov 18 '14 at 07:36

score 0 · Answer 2 · answered Nov 18 '14 at 07:08

if you are in a Linux/Unix environment you can use sed like so (a.txt is your input file):

<a.txt sed 's/\(^[^ (]*\)[^:]*: */\1 /'

The regular expression ^[^ (]* matches the start of each line (^) and zero of more characters that are not space or left parenthesis ([^ (]*) and by putting it around \( and \) you make sed "remember" the matching string as \1. Then the [^:]*: * expression matches any character up and including the colon and zero or more spaces after that. All this matched expression is then replaced in each line with the remembered /1 string, which is actually the email. The rest of the line is the JSON data and they are left intact.

If you want a CSV or a Tab separated file you need to replace the space character after \1, e.g.

<a.txt sed 's/\(^[^ (]*\)[^:]*:/\1,/'

Sorry, I'm currently using windows but i'll try it later on linux. Cheers!! — warwick12, Nov 18 '14 at 07:37

Parsing semi-structured json data(Python/R)

2 Answers2

Linked