1

I am trying to create a list with all the newspapers articles from 5 different sources. They are stored in JSON format. All articles are stored in different files that contain that contain the newspaper and the year (time spam 2005-2015). The problem is that one of the newspapers is available for only 2014-15, therefore when I loop everything together I get error. This is my attempt:

import json
import nltk
import re
import pandas

appended_data = []

for i in range(2005,2016):
    df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
    df1 = pandas.DataFrame([json.loads(l) for l in open('Scot_%d.json' % i)])
    df2 = pandas.DataFrame([json.loads(l) for l in open('APJ_%d.json' % i)])
    df3 = pandas.DataFrame([json.loads(l) for l in open('TH500_%d.json' % i)])
    df4 = pandas.DataFrame([json.loads(l) for l in open('DRSM_%d.json' % i)])
    appended_data.append(df0)
    appended_data.append(df1)
    appended_data.append(df2)
    appended_data.append(df3)
    appended_data.append(df4)


appended_data = pandas.concat(appended_data)

doc_set = appended_data.body

My question is; does this code does what I am aiming? (creating a single list with the body of all articles from each newspaper along time); and, how can I program it in a way that I skip the years 2005-2013 for the first newspaper (SDM)

Stefan
  • 41,759
  • 13
  • 76
  • 81
Economist_Ayahuasca
  • 1,648
  • 24
  • 33
  • 1
    It's difficult to answer your first question without any data, but for your second question, you can test if the file exists with [os.path.exists](https://docs.python.org/3/library/os.path.html). – IanS May 25 '16 at 14:31
  • Note you should check that the read works and the other files exist just in case. – sabbahillel May 25 '16 at 14:39
  • 1
    You should close your files to avoid too many open at once. Even if this works, it is better to start with good habits for future programs. – sabbahillel May 25 '16 at 14:43

2 Answers2

1

For the skipping part, you can:

for i in range(2005,2016):
    if i > 2013:
        df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
        appended_data.append(df0)
    df1 = pandas.DataFrame([json.loads(l) for l in open('Scot_%d.json' % i)])

To know whether the code performs as expected we'd need so sample data.

Stefan
  • 41,759
  • 13
  • 76
  • 81
  • You're welcome. The `os.path.exists()` suggestion in the comments is also one to keep in mind for the general case where you don't know beforehand which files may or may not exist. – Stefan May 25 '16 at 14:36
1

First of all you need to check the version of python to determine what type of error is thrown when the file name is not found as explained in Python's "open()" throws different errors for "file not found" - how to handle both exceptions?

Secondly, just in case there is a problem with some of the files you should check all of the possibilities.

One way would be

  1. Create a list of file name prefixes ['SDM', 'Scot', 'AP', 'TH500', 'DRSM')

  2. loop over the years

  3. create the file names as a loop in the list of names

  4. open the file in a try: except pair in order to handle any problems as myfile Alternatively, you can also use os.path.exists() in an if to avoid the try except. However, you should have a try except pair anyways just in case something else goes wrong.

  5. read the data into df = pandas.DataFrame([json.loads(myfile)) Note you should also encapsulate this as a try ... except pair

  6. close the file so that you do not have too many open.

  7. Now append df to the list you are creating.

This should handle the situation.

Community
  • 1
  • 1
sabbahillel
  • 4,357
  • 1
  • 19
  • 36