I'm trying to read from a file that is currently being written to at high speeds by a different python script. There are ~ 70,000 lines in the file. When I try to read in the lines, I generally get to ~7,750 before my application exits.
I think this is due to the file being written to (append only). I have processed larger files (20k lines), but only while not being written to.
What steps can I take to troubleshoot further? How can I read from this file, despite it currently being written to?
I'm new-ish to Python. Any/all help is appreciated.
tweets_data = []
tweets_file = open(tweets_data_path, "r")
i = 0
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
i += 1
if i % 250 == 0:
print i
except:
continue
## Total # of tweets captured
print len(tweets_data)
- Python 2.7
- Ubuntu 14.04
Traceback: I get this for every read
ValueError: No JSON object could be decoded
Traceback (most recent call last):
File "data-parser.py", line 33, in <module>
tweet = json.loads(line)
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
UPDATE:
I've modified my code to follow the suggestions put forth by @JanVlcinsky. I've identified that the issue is not that the file is being written to. In the below code, if I comment out tweets_data.append(tweet)
, or if I add a condition so that tweets are only added to the array half as often, my program works as expected. However, if I try to read in all ~90,000 lines, My application exits prematurely.
tweets_data = []
with open(tweets_data_path) as f:
for i, line in enumerate(f):
if i % 1000 == 0:
print "line check: ", str(i)
try:
## Skip "newline" entries
if i % 2 == 1:
continue
## Load tweets into array
tweet = json.loads(line)
tweets_data.append(tweet)
except Exception as e:
print e
continue
## Total # of tweets captured
print "decoded tweets: ", len(tweets_data)
print str(tweets_data[0]['text'])
Premature Exit Output:
When loading every valid line into the array...
...
line check: 41000
line check: 42000
line check: 43000
line check: 44000
line check: 45000
dannyb@twitter-data-mining:/var/www/cmd$
When loading every other valid line into the array...
...
line check: 86000
line check: 87000
line check: 88000
dannyb@twitter-data-mining:/var/www/cmd$
When loading every third valid line into the array...
...
line check: 98000
line check: 99000
line check: 100000
line check: 101000
decoded tweets: 16986
Ultimately leaving me to believe the issue is related to the size of the array and my available resources? (On a VPS w/ 1GB RAM)
FINAL: Doubling the RAM fixed this issue. It appears that my Python script was exceeding the amount of RAM made available to it. As a follow-up, I've started looking at ways to improve in-memory RAM efficiency, and ways to increase the total amount of RAM available to my script.