5

I'm extracting xml data from 465 webpages ,and parsing and storing it in ".csv" file using python dataframe. After running the program for 30 mins, the program saves "200.csv" files and kills itself. The command line execution says "Killed". But when I run the program for first 200 pages and rest of 265 pages for extraction separately, it works well. I had thoroughly searched on the internet, no proper answer for this issue. Could you please tell me what could be the reason?

for i in list:
    addr = str(url + i + '?&$format=json')
    response = requests.get(addr, auth=(self.user_, self.pass_))
    # print (response.content)
    json_data = response.json()
    if ('d' in json_data):
        df = json_normalize(json_data['d']['results'])
        paginate = 'true'
        while paginate == 'true':
            if '__next' in json_data['d']:
                addr_next = json_data['d']['__next']
                response = requests.get(addr_next, auth=(self.user_, self.pass_))
                json_data = response.json()
                df = df.append(json_normalize(json_data['d']['results']))
            else:
                paginate = 'false'
                try:
                    if(not df.empty):
                        storage = '/usr/share/airflow/documents/output/' + i + '_output.csv'
                        df.to_csv(storage, sep=',', encoding='utf-8-sig')
                    else:
                        pass
                except:
                        pass

Thanks in advance!

Alex
  • 6,610
  • 3
  • 20
  • 38
Raj
  • 585
  • 4
  • 16
  • 28
  • 1
    If the program is being killed you are most likely running out of memory. I would suggest parsing your files in smaller batches, say 20-40 files at a time, exporting them and then concatenating the exported files afterwards. – Alex Oct 29 '18 at 10:11
  • I'm not actually concatenating the files together. I'm parsing each webpage and storing it in a different csv file. But it is 465 webpages and 465 ".csv" files. Thanks – Raj Oct 29 '18 at 10:23
  • In that case, maybe post your code so that we could suggest changes. More information is needed to help you. Take a look here: [reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – Alex Oct 29 '18 at 11:34
  • @Alex The code is uploaded. please let me know if you have any questions. Thanks – Raj Oct 29 '18 at 12:18
  • Do you see anything like "OOM Killer" in the system logs from around the time when this happened? – tripleee Oct 29 '18 at 12:47
  • Also here: https://stackoverflow.com/questions/19189522/what-does-killed-mean-when-a-processing-of-a-huge-csv-with-python-which-sudde – Jeyekomon Jun 08 '21 at 12:56

1 Answers1

4

It looks like you are running out of memory.

Can you try to increase allowed memory (fast solution)
Or optimize your code for less memory consumption (best solution)

If speed is not what is required, you can try to save data to temp files and read from them when needed, but I guess that for loop can be optimised for less memory consumption without using the file system.
After all, memory is the place where the loop should live.

Try to run your code without try catch block

Stevan Tosic
  • 6,561
  • 10
  • 55
  • 110
  • It's very likely that. `dmesg` should have an entry saying that the process got killed because it run out of memory – hek2mgl Oct 29 '18 at 15:03
  • @Stevan, I tried and ran the code without try and catch block but its still not working. – Raj Oct 29 '18 at 22:36
  • @raj ok, are you try to increase memory limit? How much memory you have available for this machine? – Stevan Tosic Oct 29 '18 at 22:37
  • @StevanTosic I'm running my program on linux server. Is there a way to increase it? Thanks – Raj Oct 29 '18 at 22:42
  • Traceback (most recent call last): File "winning.py", line 103, in main() File "winning.py", line 100, in main cred.Successfactors() File "winning.py", line 71, in Successfactors json_data = response.json() File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 896, in json return complexjson.loads(self.text, **kwargs) File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 860, in text content = str(self.content, encoding, errors='replace') MemoryError . ------------ This is the error I got now. – Raj Oct 29 '18 at 22:45
  • @raj Yes, as I see from error trace problem is out of memory on server. If this is shared hosting you will need to check with hosting provider how to increase memory. In other hand I am not sure how much you have experience with programing but your code is very “heavy” for system. You are doing a lot of stuff in one loop. You can also try to run loop in chunk i.e. first 0 - 50, then from 51 - 100, 101 - 150 ... – Stevan Tosic Oct 29 '18 at 23:05
  • @StevanTosic, Hi, does having 30 RAM (28 available approx.) able to run the program or will I get the same error? – Raj Oct 30 '18 at 05:18