0

I am trying to fetch data from Last.fm, and write these data to a CSV file. I have a.csv, and based on each line of this a.csv, I fetch additional data from Last.fm, then save them to b.csv. So as a result, a.csv and b.csv are of the same size.

a.csv is a large text file with about 8 million data lines, so I am trying to run multiple processes that each process about 250,000 lines.

I tried with the python multiprocessing module, and I also tried running multiple terminals. The problem is that most of the time (about 9 out of 10 or more), the processes randomly stop writing to each CSV file.

For example, I start running 4 processes, and they will normally start writing to separate CSV files. Then when random time passes, few of the CSV files won't be modified anymore. Sometimes one of the CSV will stop right after (a few minutes or so) I start running the process, and other csvs also stop after a few hours, or a few decimal hours. These patterns are totally random, and very rarely, all the processes will finish successfully, which is why I cannot figure out the reason they keep stopping. I tried on other computers and there is no difference, so the problem doesn't seem computing resource-dependent.

Also, even though the CSV files stop being modified, the process is still running, as I made the code print its progress to the terminal every 1000 data lines.

Following is the overall structure of my code (I just wrote the codes that I thought is indispensable to understand the program, in abstracted form):

f_reader = csv.reader(f, delimeter=',')
# (same for other csv files needed ..)

for line in a.csv:
    if 1000 data lines are processed:
        print('1000 tracks processed')

    url = Lastfm API root url + selective data in line
    req = urllib2.Request(url) 
    response = urllib2.urlopen(req) # fetch data from Last.fm and save result to req
    info = etree.fromstring(response.read())

    temp1 = info.find('data1').text.encode('utf-8')
    temp2 = info.find('data2').text.encode('utf-8')
    temp = [temp1, temp2]

    for column in temp:
        f.write('%s;' % column)
    f.write('\n')
f.close()

Can anyone help?

user3052069
  • 323
  • 1
  • 3
  • 12
  • 1
    So are you saying you have multiple processes writing to the same files? If you are not using locking, that *will* result in weird behaviour and lost data. – Martijn Pieters Aug 18 '16 at 06:27
  • No, they are each writing to separate files, as I mentioned: 'they will normally start writing to separate csv files'. – user3052069 Aug 18 '16 at 06:31
  • And how do you determine that the files are no longer being written to? Did you account for file buffering? Perhaps you can use the `logging` module to track how much info each process is receiving? – Martijn Pieters Aug 18 '16 at 06:39
  • I have almost no experience in system programming, which is why I didn't try logging in process or file buffer level.. I just determined that the files are no longer being written to based on the 'Modified' time that I can check via the file property, because the modified time keeps changing as the process normally runs, and the file size grows every 1~3 minutes. As the file property is modified every 1~3 minutes, I thought there is no way that the data suddenly writes in big chunks. – user3052069 Aug 18 '16 at 06:51
  • Logging should help you debug what is going on; it may be that the LastFM servers are throttling your requests, for example. – Martijn Pieters Aug 18 '16 at 06:54
  • The OS is surely writing the data to disk in chunks - it would be inefficient to do otherwise. – Paul Cornelius Aug 18 '16 at 06:56
  • @MartijnPieters I thought of logging in any way, but I can't think of how or where to log, because the stopping patterns are so random as I mentioned.. Can you help me with any ideas according to the code structure I wrote? And to add, I already got a permission from Last.FM to fetch with multiple processes. – user3052069 Aug 18 '16 at 06:58
  • Many things could go wrong there like: network errors when performing the request, xml parsing errors, encoding errors. But since you said the process does not terminate, my guess is that no exceptions are being raised but it blocks in urlopen, since the default socket timeout is `None` (meaning no socket timeout). Try `urllib2.urlopen(req, timeout=10)` and see if you get any errors, if you do, you can implement a retry mechanism. – Seba Aug 18 '16 at 06:58
  • @PaulCornelius Yes it does seem so, but is it possible that the chunk size (largely) changes? Even if it is possible, The csv file that stopped being written wouldn't grow after any further hours passed.. :( – user3052069 Aug 18 '16 at 07:02
  • @Seba I wrote a few exception blocks where xml or url errors appeared, which is why I thought something else is stuck when writing to csv.. I didn't give any socket timeout, but the exception will catch if something goes wrong with the urllib2.urlopen. I'm so comfused! – user3052069 Aug 18 '16 at 07:06
  • Oops, I missed that you mentioned the process keeps printing the progress, so it isn't blocking. This is unrelated but you should use the [CSV module](https://docs.python.org/2/library/csv.html#writer-objects) from the standard lib instead of writing to the file object directly. Since if the data happens to contain a ';' or new line it will mess up the csv structure. – Seba Aug 18 '16 at 07:08
  • I would suggest closing and re-opening the file every 1000 writes. But also see [this post](http://stackoverflow.com/questions/28326378/not-able-to-write-into-a-file-using-python-multiprocessing?rq=1). – Paul Cornelius Aug 18 '16 at 07:13
  • @Seba Oh, it's my bad. I used the CSV module, I just didn't include in the abstracted code.. I edited my post. – user3052069 Aug 18 '16 at 07:14
  • @PaulCornelius I read the link you referenced, but the situation in that question is a bit different from mine, since mine does start writing to a new file.. And could you further explain why I should try re-opening the file every 1000 writes please? – user3052069 Aug 18 '16 at 07:25
  • I'm not suggesting that you should leave it that way in the final version of the program. It's a troubleshooting tactic, in the interest of changing something to get more information. You're saying that the program keeps running but mysteriously stops writing to the files. That seems almost impossible. By closing and re-opening the files you force the file time stamp to change, which you can observe, and eliminate any issues about the OS caching the data. When it fails it would be interesting to see what the last version of the file looked like. – Paul Cornelius Aug 18 '16 at 07:42
  • @PaulCornelius You're right. I should try closing, reopening and modifying the file every 1000 writes, as I don't remember any stops occuring 'too' early, which means less than 1000 lines.. then if it doesn't work, I should try re-opening every 100 lines or something. Thank you for all your advices. – user3052069 Aug 18 '16 at 08:15

1 Answers1

0

Try adding a f.flush() call somewhere, for instance, in your 1000 lines checkpoint. Maybe the files are just being buffered, and not getting written to disk. For example, see How often does python flush to a file?

Community
  • 1
  • 1
VBB
  • 1,305
  • 7
  • 17
  • It seems like the flush could be related to my problem, I'll search for flush and try. But I still don't understand because my program does write to file, I can open the file and see that data lines are written, it just suddenly stops writing randomly. – user3052069 Aug 18 '16 at 07:29
  • The OS will flush files from time to time, so you will see something in the files. It's just that the next flush _might_ happen longer than you've been waiting. – VBB Aug 19 '16 at 08:50