How to read a file in python which has newline and tabs into a string?

Question

I am trying to read a file which has tabs and newline etc and the data is JSON format.

When I read it using file.read()/readlines() etc, all the newlines and tabs are also read.

I have tried rstrip(), split etc but in vain, maybe I am missing some thing:

Here is essentially what I am doing:

 f = open('/path/to/file.txt')
 line = f.readlines()
 line.split('\n')

This is the data (including the raw tabs, hence the poor formatting):

        {
      "foo": [ {
       "id1" : "1",
   "blah": "blah blah",
       "id2" : "5885221122",
      "bar" : [
              {  
         "name" : "Joe JJ", 
          "info": [                 {
         "custid": "SSN",    
         "type" : "String",             }        ]
        }     ]     }     ]  }

I was wondering if we can ignore it elegantly.

Also hoping to use json.dumps()

If you simply have mal-formed data, I'm not sure that there is any truly reliable way to "clean" it. Trivially, if you replace all spaces, you'll nuke the content, too. Some of the regex clues may help, but you'll have to know that you're trying your best to make good data out of bad data with poor accuracy. — Matt Feifarek, Jul 27 '11 at 16:07

score 6 · Answer 1 · answered Jul 26 '11 at 22:34

6

Why not just use json.load() if the data is json?

import json
d = json.load(open('myfile.txt', 'r'))

answered Jul 26 '11 at 22:34

Matt Feifarek

510
2
8

1

Oh, I suppose json is choking? – Matt Feifarek Jul 26 '11 at 22:36

liliumdev · Answer 2 · 2011-07-26T22:42:41.943

2

A little hack, inefficient I guess:

f = open("/path/to/file.txt")
lines = f.read().replace("\n", "").replace("\t", "").replace(" ", "")

print lines

edited Jul 26 '11 at 22:42

answered Jul 26 '11 at 22:36

liliumdev

1,159
4
13
25

There could be spaces inside strings. – agf Jul 26 '11 at 23:10

score 0 · Answer 3 · answered Jul 26 '11 at 22:33

0

Where did that structure come from? My condolences. Anyway, as a start you might try this:

cleanedData = re.sub('[\n\t]', '', f.read())

That's a brute-force removal of newline and tab characters. What it returns might be suitable for feeding into json.loads. It'll depend greatly on whether or not the contents of the file are actually valid JSON once you clear out the extra white space and line breaks.

answered Jul 26 '11 at 22:33

g.d.d.c

46,865
9
101
111

Might want to add `\r` to the list. – agf Jul 26 '11 at 23:09
@agf - he certainly could, but it sounds like it's not valid even if you get rid of the extra whitespace, not to mention this isn't really a great answer if there's a chance that his values contain tabs or newlines. It was just a stab at it. – g.d.d.c Jul 27 '11 at 02:45

score 0 · Answer 4 · answered Jul 26 '11 at 22:35

0

If you want to loop over each line, you can just:

for line in open('path/to/file.txt'):
  # Remove whitespace from both ends of line
  line = line.strip()

  # Do whatever you want with line

answered Jul 26 '11 at 22:35

Steven Hepting

12,394
8
40
50

score 0 · Answer 5 · answered Jul 26 '11 at 22:37

0

What about the usage of the json module?

import json

tmp = json.loads(open("/path/to/file.txt", "r"))

output = open("/path/to/file2.txt", "w")
output.write(json.dumps(tmp, sort_keys=True, indent=4))

answered Jul 26 '11 at 22:37

Felipe

161
2
7

score 0 · Answer 6 · answered Jul 26 '11 at 23:24

$ cat foo.json | python -mjson.tool
Expecting property name: line 11 column 41

The comma in "type" : "String", is causing the JSON decoder to choke. If it wasn't for that problem, you could use json.load() to load the file directly.

In other words, you have malformed JSON, meaning you'll need to perform a replacement operation before feeding it to json.loads(). Since you'll need to read the file into a string completely to do the replacement operation anyway, use json.loads(jsonstr) instead of json.load(jsonfilep):

    >>> import json, re
    >>> jsonfilep = open('foo.json')
    >>> jsonstr = re.sub(r'''(["'0-9.]\s*),\s*}''', r'\1}', jsonfilep.read())
    >>> jsonobj = json.loads(jsonstr)
    >>> jsonstr = json.dumps(jsonobj)
    >>> print(jsonstr)
    {"foo": [{"blah": "blah blah", "id2": "5885221122", "bar": [{"info":
    [{"type": "String", "custid": "SSN"}], "name": "Joe JJ"}], "id1": "1"}]}

I only used the re module because it could happen for any value, number or string.

How to read a file in python which has newline and tabs into a string?

6 Answers6