5

I have a few files with generalized extensions, such as "txt" or no extension at all. I'm trying to determine in a very quick manner whether the file is json or a csv. I thought of using the magic module, but it doesn't work for what I'm trying to do. For example:

>>> import magic
>>> magic.from_file('my_json_file.txt')
'ASCII text, with very long lines, with no line terminators'

Is there a better way to determine if something is json or csv? I'm unable to load the entire file, and I want to determine it in a very quick manner. What would be a good solution here?

  • Even if there is a way to find out the _type_ of file based on its contents, you may not get accurate results if the JSON inside the file is invalid or if the delimiter is not consistent across the csv data and such other problems. Moreover, everything inside a txt file is considered as `string` type, doesn't matter if its JSON or not. – amanb Feb 14 '19 at 20:01
  • 1
    @amanb that's fine if it's not valid. I just want to see -- based on the first 1000 characters in the file is the "probably json or probably csv". Right now doing something like `s.startswith('{')` is giving me better results than `magic` so there's got to be something that's a bit more accurate... –  Feb 14 '19 at 20:01
  • Hmm, you are unable to load the entire file, but magic.from_file is able to say that there are no line terminators. Apparently it can load the entire file. – RemcoGerlich Feb 14 '19 at 20:07
  • [Helpful semi-related post](https://stackoverflow.com/questions/6475328/how-can-i-read-large-text-files-in-python-line-by-line-without-loading-it-into) for future reference – jonroethke Feb 14 '19 at 20:09
  • @RemcoGerlich I've just copy-pasted some data into that file for testing purposes. The files could be very large (10GB) and I'm only downloading the first 1KB or so to see which file type it may be where it doesn't have an explicit extension. –  Feb 14 '19 at 20:10
  • `"hello"` is itself a valid standalone JSON document (for some versions of the standard). Which is to say that if you're looking at a degenerate (single-row, single-column) case, the same document can be *both* a valid CSV file and a valid JSON file. :) – Charles Duffy Feb 14 '19 at 20:16
  • Maybe it's as easy as checking whether there's a line break character in the first 1000 characters. Then it's a CSV. Will probably work for the vast majority of cases. – RemcoGerlich Feb 14 '19 at 20:40

2 Answers2

6

You can check if the file starts with either { or [ to determine if it's JSON, and you can load the first two lines with csv.reader and see if the two rows have the same number of columns to determine if it's CSV.

import csv
with open('file') as f:
    if f.read(1) in '{[':
        print('likely JSON')
    else:
        f.seek(0)
        reader = csv.reader(f)
        try:
            if len(next(reader)) == len(next(reader)) > 1:
                print('likely CSV')
        except StopIteration:
            pass
blhsing
  • 91,368
  • 6
  • 71
  • 106
  • 2
    simple approach, I like this. Thanks for this solution. –  Feb 14 '19 at 20:07
  • one question on this -- why wouldn't you want to open the file in `rb` mode? What if, for example, it's not in utf-8 coding (let's say it's utf-16 encoding)? –  Feb 14 '19 at 20:09
  • 1
    It's not required for all rows to have the same number of columns in a CSV. Indeed, leaving a completely blank row after the header and before the beginning of data is not unheard of. – Charles Duffy Feb 14 '19 at 20:14
  • beautiful solution!..I may use it for my own purposes some day. – amanb Feb 14 '19 at 20:14
  • But if it's a 10GB JSON file on a single line, then this is problematic. – RemcoGerlich Feb 14 '19 at 20:39
0

You can use the try/catch "technique" trying to parse the data to JSON object. When loading an invalid formatted JSON from a string it raises a ValueError which you can catch and process however you want:

>>> import json
>>> s1 = '{"test": 123, "a": [{"b": 32}]}'
>>> json.loads(s1)

If valid, nothing happens, if not:

>>> import json
>>> s2 = '1;2;3;4'
>>> json.loads(s2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 369, in decode
    raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 2 - line 1 column 8 (char 1 - 7)

So you can build a function as follows:

import json

def check_format(filedata):
    try:
        json.loads(filedata)
        return 'JSON'
    except ValueError:
        return 'CSV'

>>> check_format('{"test": 123, "a": [{"b": 32}]}')
'JSON'
>>> check_format('1;2;3;4')
'CSV'
josepdecid
  • 1,737
  • 14
  • 25