Python - Parsing JSON formatted text file with regex

Question

I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?

The text shows up like this:

{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌au\/ns\/business\/wi‌ki","author":null,"d‌escription":null,"fi‌leAssetId":"034b9317‌-60d9-45c2-b6d6-0f24‌b59e1991","filename"‌:"Reports.pdf"},"cre‌atedBy":1531,"create‌dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌bat.png","id":3041,"‌inheritedPermissions‌":false,"name":"map"‌,"permissions":[23,8‌7,35,49,65],"type":3‌,"viewLevel":2},{"__‌type":"WikiNode:http‌:\/\/samplesite.com.‌au\/ns\/business\/wi‌ki","children":[],"c‌ontent":

I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error

For the FileAssetid I tried this regex:

regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")

But i get the following 034b9317‌, 60d9, 45c2, b6d6, 0f24‌b59e1991

Im not to sure how to get the data as its displayed.

The text shows up like this: {"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.au\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content": I am wanting to get the "fileAssetId" and filename" — iHaag, Nov 23 '17 at 11:54
the dictionary is not completed. You are missgin `[` at the beginning and `}]` and the end — ezdazuzena, Nov 23 '17 at 12:01
I would love to extract the value after "fileAssetId": and the value after the filename, but I'm not to sure how to do it. — iHaag, Nov 23 '17 at 12:06
If you have an example I am more than happy to try it, everything ive tried I get KeyError: 'filename' — iHaag, Nov 23 '17 at 12:20
You should describe the error when you tried to use json.loads. It is probably more robust to fix that error than using a regex... — Serge Ballesta, Nov 23 '17 at 12:50
I tried the following: import re import json from pprint import pprint json_data=open('jsonfile').read() data = json.loads(json_data) pprint(data) Everything prints like: u'id': 204, line by line, but if i try to define keys that where i run into trouble: import re import json from pprint import pprint json_data=open('jsonfile').read() data = json.loads(json_data) data["filename"][0] pprint (data) — iHaag, Nov 23 '17 at 13:00
if i try with open('jsonfile', 'r') as f: distros_dict = json.load(f) for distro in distros_dict: print(distro['filename']) the Error is 'TypeError: string indices must be integers' — iHaag, Nov 23 '17 at 13:07

ezdazuzena · Answer 1 · 2017-11-23T13:29:08.123

1

You can use python's walk method and check each entry with re.match.

In case that the string you got is not convertable to a python dict, you can use just regex:

print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)

Solution for your example:

import re

example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'

regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))

executing this yields:

34b9317‌-60d9-45c2-b6d6-0f24‌b59e1991

edited Nov 23 '17 at 13:29

answered Nov 23 '17 at 11:48

ezdazuzena

6,120
6
41
71

coming up red in sublime text \":\"([^\"]+)\".*`).group(1) – iHaag Nov 23 '17 at 12:13
File "test2.py", line 18 fileAsset = re.match(r`.*fileAssetId\":\"([^\"]+)\".*`, regex).group(1) ^ SyntaxError: invalid syntax – iHaag Nov 23 '17 at 12:16
you are missing `'` – ezdazuzena Nov 23 '17 at 12:22
Thank you, that's better but errors out. return _compile(pattern, flags).match(string) TypeError: expected string or buffer – iHaag Nov 23 '17 at 12:26
As a one complete regex request, i get: fileAsset = re.match(r'.*fileAssetId\":\"([^\"]+)\".*', r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})").group(1) AttributeError: 'NoneType' object has no attribute 'group' – iHaag Nov 23 '17 at 12:29
read a bit the docu! .. you don't need `r` infront of you string. You got the AttributeError because it was not match. Try pages like http://regex101.com/ to fine-tune your regular expression – ezdazuzena Nov 23 '17 at 12:30
Removing the r gives SyntaxError: invalid syntax. I used the regex101, removed \S* as it looks like that wasn't required, but, i cant remove the r. – iHaag Nov 23 '17 at 12:38
in my example, your_pattern is the string you received. the one you want to look at. This is the second argument in `re.match`. The first one is the regular expression – ezdazuzena Nov 23 '17 at 12:41
so what would i define 'your_pattern" as if not a pattern of the regular expression? – iHaag Nov 23 '17 at 13:01
regex = re.compile(r"([0-9a-f]{8})-([0-9a-f]{4})-([0-9a-f]{4})-([0-9a-f]{4})-([0-9a-f]{12})") fileAsset = re.match(r'.*fileAssetId\":\"([^\"]+)\".*', regex).group(1) That is how i have it layed out, i used the editor online - it doesnt like your .*fileAssetId\":\"([^\"]+)\".*' – iHaag Nov 23 '17 at 13:34
Your example works, even with the text file, but its only showing one result, how could I loop if to search for all the 'fileAttributes"? – iHaag Nov 23 '17 at 13:54
do it bit of research and don't expect SO to write your code ;) https://stackoverflow.com/questions/4697882/how-can-i-find-all-matches-to-a-regular-expression-in-python – ezdazuzena Nov 23 '17 at 14:22
give it an up vote, if you think it was helpful. And mark it as correct if you think it was the correct answer. thanks ;) – ezdazuzena Nov 23 '17 at 15:20
what if fileAssetId has a number (integer or float) instead of string? – Khaled Ahmed Sobhy Nov 23 '19 at 07:35
1

@KhaledAhmedSobhy `[^\"]+` matches at least one character that is not `"`, so int and float will be matched as well. Though, you might want to cast it to an int or float once matched. – ezdazuzena Nov 25 '19 at 08:09

score 1 · Accepted Answer · answered Nov 23 '17 at 14:01

1

How about using positive lookahead and lookbehind:

(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")

captures the fileAssetId and

(?<=\"filename\":\").+?(?=\")

matches the filename.

For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)

To get a list of all matches use re.findall or re.finditer instead of re.match.

re.findall(pattern, string) returns a list of matching strings.

re.finditer(pattern, string) returns an iterator with the objects.

answered Nov 23 '17 at 14:01

Igl3

4,900
5
35
69

That works, thank you so much, but its only showing the first, not all the values, im doing it this way: import re f=open("jsonfile.txt") f=f.readlines() for line in f: m = re.search(r'(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")|(?<=\"filename\":\").+?(?=\")', line) print m.group() – iHaag Nov 23 '17 at 14:12
As i said in my answer edit, use findall or finditer not search. – Igl3 Nov 23 '17 at 14:13
That works a treat, thank you. Is there a way i could store all the values for "filename" and "fileAssetId" so i could do something like wget = urllib.urlopen('http://samplewebsite.com' + fileAssetId_value + filename_value) ??? Thank you for your help. – iHaag Nov 23 '17 at 14:26
If one asset id is always associated with one filename, I would try to fix your json data and load it instead of using regex as it'll be a very complex regex to get the associated values. Can you do `with open('jsonfile', 'r') as f: distros_dict = json.load(f) for distro in distros_dict: print(distro)` and share the output? Then I can maybe tell you why you can't access the filename. – Igl3 Nov 23 '17 at 14:28
Output for your code loading it as a JSON file was just the letter d – iHaag Nov 23 '17 at 14:35
Assuming that the asset id is always before the filename in the String you can get tuples with this regex: `((?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")).+?((?<=\"filename\":\").+?(?=\"))` which would yield a list like this `[('034b9317-60d9-45c2-b6d6-0f24b59e1991', 'Reports.pdf'), ('034b9317-60d9-45c2-b6d6-0f24b5a23491', 'Reports2.pdf')]` – Igl3 Nov 23 '17 at 14:35
That works. I'll see how I go trying to get individual values and create an address from them, thank you so much for all your assistance, I really appreciate it. – iHaag Nov 23 '17 at 14:44
Your welcome. Please accept my answer if it helped you and solved the problem you asked for. – Igl3 Nov 23 '17 at 14:54
Will do, any tips for how to group the values together so the filename + fileAttributeId. The other option ive thought about was export them to a csv file and starting a new file to grab row1 and row2 assigning them values. i'll research it, I just thought id ask :) - am still learning Python. – iHaag Nov 23 '17 at 14:58
Just research a little bit about string formatting yourself and I bet you'll find a good way to format your url. – Igl3 Nov 23 '17 at 15:02
Thanks once again I appreciate it – iHaag Nov 23 '17 at 15:02
Complete regex and CSV export is (?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")|(?<=\"filename\":\").+?(?=\")|(?<=\"id\":)[a-fA-F0-9-]+?(?=\,). with open('file.csv', 'wt') as f: writer = csv.writer(f) writer.writerow(pattern_name) - Thank you Igle. Progressing slowly – iHaag Nov 23 '17 at 23:18
Okay I'm stuck, I want to put the information in Columns in an excel file, so fileAssetId Filename Id the found FileAssettID in one column. the filename in another column and id in the next in order... etc I've also discovered Id shows up on other objects so i need Id only to show up AFTER filename has been found, any ideas? – iHaag Nov 24 '17 at 14:59

score 0 · Answer 3 · answered Nov 23 '17 at 13:37

0

Try adding \n to the string that you are entering in to the file (\n means new line)

answered Nov 23 '17 at 13:37

RandomCoder

16

score 0 · Answer 4 · answered Nov 24 '20 at 15:26

Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:

json_pattern = (
    r'(?(DEFINE)'
    r'(?P<whitespace>( |\n|\r|\t)*)'
    r'(?P<boolean>true|false)'
    r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
    r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
    r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
    r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
    r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
    r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
    r'(?P<document>(?&object)|(?&array))'
    r')'
    r'(?&document)'
)

json_regex = regex.compile(json_pattern)

match = json_regex.match(json_document_text)

You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.

Python - Parsing JSON formatted text file with regex

4 Answers4

Linked

Related