0

I'm trying to load 10 dependent directories, which contains a bunch of JSON files, the structure is shown below:

5 events which divided into 2 categories

for fpathe1,dirs1,fs1 in os.walk('../input/charliehebdo/rumours/'):
 for f in fs1:
    with open(os.path.join(fpathe1,f)) as dir_loc:
        data.append(json.loads(dir_loc.read()))
        charliehebdo = pd.DataFrame(data)
        charliehebdo['label'] = 'TRUE'
        charliehebdo['event'] = 'charliehebdo'
for fpathe2,dirs2,fs2 in os.walk('../input/charliehebdo/non-rumours/'):
     for f in fs2:
        with open(os.path.join(fpathe2,f)) as dir_loc:
            data.append(json.loads(dir_loc.read()))
            nonRumourcharliehebdo = pd.DataFrame(data)
            nonRumourcharliehebdo['label'] = 'FALSE'
            nonRumourcharliehebdo['event'] = 'charliehebdo'
for fpathe3,dirs3,fs3 in os.walk('../input/ferguson/rumours/'):
 for f in fs3:
    with open(os.path.join(fpathe3,f)) as dir_loc:
        data.append(json.loads(dir_loc.read()))
        ferguson = pd.DataFrame(data)
        ferguson['label'] = 'TRUE'
        ferguson['event'] = 'ferguson'
for fpathe4,dirs4,fs4 in os.walk('../input/ferguson/non-rumours/'):
     for f in fs3:
        with open(os.path.join(fpathe3,f)) as dir_loc:
            data.append(json.loads(dir_loc.read()))
            nonRumourferguson = pd.DataFrame(data)
            nonRumourferguson['label'] = 'FALSE'
            nonRumourferguson['event'] = 'ferguson'

However, the sample code is extremely time-consuming(I ran on my laptop with Intel Core i7-4720HQ and it cost me 24hr+) so I'm wondering if there's any better solution?

well, it seems that my structure figure confuse or mislead you so here is the dataset.raw dataset

I intended to illustrate the dataset by figure but it turns out to be worse.

Tilmant
  • 1
  • 3
  • What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for? – Mad Physicist Nov 12 '18 at 06:39
  • I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See [How can you profile a script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-script) – martineau Nov 12 '18 at 06:41
  • Your code **looks** like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it. – martineau Nov 12 '18 at 06:50
  • @MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved. – martineau Nov 12 '18 at 06:52

0 Answers0