How to implement code to manipulate files that runs in parellel?

Question

I'm trying to load 10 dependent directories, which contains a bunch of JSON files, the structure is shown below:

for fpathe1,dirs1,fs1 in os.walk('../input/charliehebdo/rumours/'):
 for f in fs1:
    with open(os.path.join(fpathe1,f)) as dir_loc:
        data.append(json.loads(dir_loc.read()))
        charliehebdo = pd.DataFrame(data)
        charliehebdo['label'] = 'TRUE'
        charliehebdo['event'] = 'charliehebdo'
for fpathe2,dirs2,fs2 in os.walk('../input/charliehebdo/non-rumours/'):
     for f in fs2:
        with open(os.path.join(fpathe2,f)) as dir_loc:
            data.append(json.loads(dir_loc.read()))
            nonRumourcharliehebdo = pd.DataFrame(data)
            nonRumourcharliehebdo['label'] = 'FALSE'
            nonRumourcharliehebdo['event'] = 'charliehebdo'
for fpathe3,dirs3,fs3 in os.walk('../input/ferguson/rumours/'):
 for f in fs3:
    with open(os.path.join(fpathe3,f)) as dir_loc:
        data.append(json.loads(dir_loc.read()))
        ferguson = pd.DataFrame(data)
        ferguson['label'] = 'TRUE'
        ferguson['event'] = 'ferguson'
for fpathe4,dirs4,fs4 in os.walk('../input/ferguson/non-rumours/'):
     for f in fs3:
        with open(os.path.join(fpathe3,f)) as dir_loc:
            data.append(json.loads(dir_loc.read()))
            nonRumourferguson = pd.DataFrame(data)
            nonRumourferguson['label'] = 'FALSE'
            nonRumourferguson['event'] = 'ferguson'

However, the sample code is extremely time-consuming(I ran on my laptop with Intel Core i7-4720HQ and it cost me 24hr+) so I'm wondering if there's any better solution?

well, it seems that my structure figure confuse or mislead you so here is the dataset.raw dataset

I intended to illustrate the dataset by figure but it turns out to be worse.

What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for? — Mad Physicist, Nov 12 '18 at 06:39
I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See [How can you profile a script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-script) — martineau, Nov 12 '18 at 06:41
Your code **looks** like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it. — martineau, Nov 12 '18 at 06:50
@MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved. — martineau, Nov 12 '18 at 06:52

How to implement code to manipulate files that runs in parellel?

0 Answers0