0

I'm trying to read from a file that is currently being written to at high speeds by a different python script. There are ~ 70,000 lines in the file. When I try to read in the lines, I generally get to ~7,750 before my application exits.

I think this is due to the file being written to (append only). I have processed larger files (20k lines), but only while not being written to.

What steps can I take to troubleshoot further? How can I read from this file, despite it currently being written to?

I'm new-ish to Python. Any/all help is appreciated.

tweets_data = []
tweets_file = open(tweets_data_path, "r")
i = 0
for line in tweets_file:
    try:
        tweet = json.loads(line)
        tweets_data.append(tweet)
        i += 1
        if i % 250 == 0:
            print i
    except:
        continue

## Total # of tweets captured
print len(tweets_data)
  • Python 2.7
  • Ubuntu 14.04

Traceback: I get this for every read

    ValueError: No JSON object could be decoded
    Traceback (most recent call last):
       File "data-parser.py", line 33, in <module>
         tweet = json.loads(line)
       File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
         return _default_decoder.decode(s)
       File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
       File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
         raise ValueError("No JSON object could be decoded")

UPDATE:

I've modified my code to follow the suggestions put forth by @JanVlcinsky. I've identified that the issue is not that the file is being written to. In the below code, if I comment out tweets_data.append(tweet), or if I add a condition so that tweets are only added to the array half as often, my program works as expected. However, if I try to read in all ~90,000 lines, My application exits prematurely.

    tweets_data = []
    with open(tweets_data_path) as f:
        for i, line in enumerate(f):
            if i % 1000 == 0:
                print "line check: ", str(i)
            try:
                ## Skip "newline" entries
                if i % 2 == 1:
                    continue
                ## Load tweets into array
                tweet = json.loads(line)
                tweets_data.append(tweet)
            except Exception as e:
                print e
                continue

    ## Total # of tweets captured
    print "decoded tweets: ", len(tweets_data)
    print str(tweets_data[0]['text'])

Premature Exit Output:

When loading every valid line into the array...

...
line check:  41000
line check:  42000
line check:  43000
line check:  44000
line check:  45000
dannyb@twitter-data-mining:/var/www/cmd$

When loading every other valid line into the array...

...
line check:  86000
line check:  87000
line check:  88000
dannyb@twitter-data-mining:/var/www/cmd$

When loading every third valid line into the array...

...
line check:  98000
line check:  99000
line check:  100000
line check:  101000
decoded tweets:  16986

Ultimately leaving me to believe the issue is related to the size of the array and my available resources? (On a VPS w/ 1GB RAM)

FINAL: Doubling the RAM fixed this issue. It appears that my Python script was exceeding the amount of RAM made available to it. As a follow-up, I've started looking at ways to improve in-memory RAM efficiency, and ways to increase the total amount of RAM available to my script.

Daniel Brown
  • 2,942
  • 5
  • 29
  • 41
  • Can you wait until your write is finished, then read? – user1157751 Jan 26 '16 at 21:42
  • The "write" is a flow of data coming from Twitter's Streaming API. It... Doesn't stop. I have a cron setup that splits each days data into separate files, but each file will be receiving writes for a straight 24-hours. – Daniel Brown Jan 26 '16 at 21:45
  • Did you try [http://stackoverflow.com/questions/3290292/read-from-a-log-file-as-its-being-written-using-python] ? – Milky Jan 26 '16 at 21:46
  • Just a thought, could you create a copy of the file and then read separately into that one while the other file is still being written to? would that work? – mmenschig Jan 26 '16 at 21:46
  • 1
    It's also possible that the read script is running faster than the write script and reaching the end of the file prematurely. It would be in my opinion much better to either wait until the file is done being written or to use a socket or some other method of streaming as opposed to writing to a file and then reading from it. – kylieCatt Jan 26 '16 at 21:46
  • @Milky -- I did see that SO post. It was interesting, but I'm not sure the same method will work here. I'm not trying to read the most recent entries _per se_, only all of the existing entries in the file at the time of script execution – Daniel Brown Jan 26 '16 at 21:49
  • @mmenschig, that is a good thought. These files are ~200mb depending on how heavy the traffic has been for the day. I'd like to avoid copying them if I can (and can you copy a file that is being written too?). I'll likely try that approach if I can't find anything more elegant. – Daniel Brown Jan 26 '16 at 21:50
  • 1
    Is there any way to integrate the two scripts? In other words, can you just pass the tweets directly instead of trying to write and read at the same time? – Jared Goguen Jan 26 '16 at 21:53
  • @IanAuld, I believe the read script is running faster than the write script, but it is not reaching the end of the file. I'm printing line counts as I read, and the script is exiting ~63,000 lines early. – Daniel Brown Jan 26 '16 at 21:53
  • @o_o, That.. Is a fun thought. I'm not entirely sure how I would. The read script is generating hourly/daily/weekly/monthly comparisons of the data. But to that effect... I could look into storing data (at shorter intervals) to a DB, and instead read from there (reducing the chances of odd IO errors and race conditions) – Daniel Brown Jan 26 '16 at 21:55
  • 1
    @DanielBrown I did my own test adding ping output lines into a file and reading from another script. It works. I would suggest you remove the `except` block or print the exception you see to show us exactly what problem you have. Exact stacktrace would be the best. – Jan Vlcinsky Jan 26 '16 at 22:02
  • 1
    I'm pretty sure `open()` isn't streaming new lines either. It loads the file in to memory when it's called and that's it. Any lines written to a file after `open()` is initially called are likely not being seen. – kylieCatt Jan 26 '16 at 22:11
  • Just do it in batches of 50k or whatever, when the file writers reach 50k the reader can read that file until the next batch is ready... – user2255757 Jan 26 '16 at 22:16
  • 1
    This really doesn't seem like a good use case for file writing/reading. This sounds like the exact situation that sockets were made to solve. – kylieCatt Jan 26 '16 at 22:19
  • @JanVlcinsky, I added a traceback of the error I'm getting. I receive it for each decode attempt, yet it doesn't appear to be what's ending the read of all of every line prematurely. If there is an exception specific to ending early, I can't seem to get it to display. – Daniel Brown Jan 26 '16 at 22:23
  • @IanAuld, after further testing, I believe you're correct. I'm starting to think that this is an issue with large files; not simultaneous read and writes. I don't need to read that data as it's coming in. I'd much rather trigger reads at specific times. Not entirely sure why sockets would be a better call? But I may be missing something. – Daniel Brown Jan 26 '16 at 22:23
  • 1
    Sockets can be treated as file like objects so the concept is the same. You have your current file writer write to a socket instead of a file. The current reader would be listening to that socket and could take an action when there is something to read. This is essentially how any streaming data service operates on the web, something posts to and endpoint that a consumer is listening on, once there is something to read it takes an action. From what you have described this is what you want to be doing. – kylieCatt Jan 26 '16 at 22:29
  • @IanAuld, hmm... Interesting. How long does that data persist? Or is it publish/subscribe based? There is one reader in my setup, not triggered by writes, but instead runs a reports based on data aggregated hourly, daily, weekly, and monthly. – Daniel Brown Jan 26 '16 at 22:32
  • 1
    https://docs.python.org/2/howto/sockets.html – kylieCatt Jan 26 '16 at 22:33
  • 1
    I think that sockets are not the way to go. Existing appended file is about storing and streaming while socket will probably serve only the streaming (unless you build behind really complex solution). – Jan Vlcinsky Jan 26 '16 at 22:39
  • Modify your twitter streaming script to output to stdout, then your reader script to read from stdin. Then `./twitter.py | tee output_file.txt | ./parser.py` – Colonel Thirty Two Jan 26 '16 at 22:47

1 Answers1

1

I think, that your plan to read tweets from continually appended file shall work.

There will be probably some surprises in your code as you will see.

Modify your code as follows:

import json
tweets_data = []
with open("tweets.txt") as f:
    for i, line in enumerate(f):
        if i % 250 == 0:
            print i
        line = line.strip()
        # skipping empty lines
        if not len(line):
            continue
        try:
            tweet = json.loads(line)
            tweets_data.append(tweet)
        except MemoryError as e:
            print "Run out of memory, bye."
            raise e
        except Exception as e:
            print e
            continue

## Total # of tweets captured
print "decoded tweets", len(tweets_data)

The modifications:

  • with open...: this is just good habit to close the file regardless of what will happen after opening it.
  • for i, line in enumerate(f): - the enumerate will generate growing number for each item from iterated f
  • moving the print of 250th line to the front. This may reveal, you really read many lines, but too many of them are not valid JSON objects. When was the print placed after the json.loads, you could miss counting lines, which failed decoding.
  • except Exception as e: it is bad habit to catch whatever exception as you did before, as the valuable information about the problem is hidden from your eyes. You will see in your real run that printed exceptions will help you to understand the problem.

EDIT: added skipping empty lines (not reyling on having empty lines being regularly present.

Aslo added direct catch for MemoryError to complain in case, we run out of RAM.

EDIT2: rewrite to use list comprehension (not sure, if this would optimize used RAM). It assumes, all non-empty lines are valid JSON strings and it does not print reports about progressing:

import json
with open("tweets.txt") as f:
    tweets_data = [json.loads(line)
                   for line in f
                   if len(line.strip())]

## Total # of tweets captured
print "decoded tweets", len(tweets_data)

It will probably run faster then the previous version as there are no append operations.

Jan Vlcinsky
  • 42,725
  • 12
  • 101
  • 98
  • Hi Jan, thanks for this reply. You've really helped me to better understand some of Python's nuances, and a cleaner way to implement my code. Unfortunately, while this did help me address some issues, my python script is still ending prematurely. Of the 90,000+ lines, it now reads (and appends) 42,000 json objects to my array. If I ask it to skip every other line, it is able to get to line 84,000. This makes me believe that I'm experiencing a memory related error, but no error is printed to the screen. Instead, the script simply stops. Any thoughts? :/ – Daniel Brown Jan 27 '16 at 05:05
  • @DanielBrown It is rather unlikely you experience memory related problem. I would recommend to make statistics how many line do not pass `json.loads` and how many lines do, possibly even storing the good and bad lines separately to see what is going on. – Jan Vlcinsky Jan 27 '16 at 08:16
  • I actually remedied that issue. Each entry was separated by an empty line. I now have the loop skip over the empty lines so only the lines containing a json object are parsed. This is all runninh on a VPS with 1gb of RAM. Before running the script, I have roughly 800MB of RAM available, and the CPU spikes up to 30% while processing. All json records are parsed and stored correctly. Applicaion still exits before reading into the entire file (only when adding to the array), and exits before finishing with no error/warning – Daniel Brown Jan 27 '16 at 13:04
  • @DanielBrown If you are really creating the file at high pace, how do you know how many tweets is the amount you really can read? You have moving target. Another problem could be buffering: the process writing into file is probably using buffer and flush to the disk only when the buffer fills up or when the file is closed. So the number of tweets appender thinks were added is likely to be often higher then is really available in the file. – Jan Vlcinsky Jan 27 '16 at 13:51
  • apologies. At this point, I've stopped trying to parse the file that is being written to. I am now looking at a full day's log of captured tweets. The file has ~ 90,000 lines; Half of wich are new lines (original cause of `json.loads` error), and the other half are single line json objects of a tweet's schema/data. At this point, I'm not attempting to parse/read a moving target. The file is closed, static, and won't be changing. However, my application still exits before reaching the end of the file, and does not output an error. I'll add the updated code + output shortly – Daniel Brown Jan 27 '16 at 14:05
  • I've added additional info (now that I'm using your proposed solution) to my initial post. – Daniel Brown Jan 27 '16 at 14:26
  • Doubling the RAM available on my VPS fixes this issue. This leads me to believe that the issue is related to using more resources than I have allocated for my Python script. I'm accepting your answer because it got me the closest to realizing this. – Daniel Brown Jan 27 '16 at 18:21
  • @DanielBrown Thanks. I have modified my code to detect empty line (based on content, not on order of lines). Plus raising exception if we run out of RAM. I guess, that even with old code the last printed exception shall say something about MemoryError. Or not? – Jan Vlcinsky Jan 27 '16 at 18:55
  • I'm not entirely sure why, but the last printed exception does not mention `MemoryError`. Even adding the `except MemoryError as e:` block doesn't trigger it. I'm at a loss as to why the exception isn't being raised, but regardless I'm working to optimize in-memory usage. (looking into python "generators") – Daniel Brown Jan 27 '16 at 19:04
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/101811/discussion-between-jan-vlcinsky-and-daniel-brown). – Jan Vlcinsky Jan 27 '16 at 19:38