1

I had converted json data from single folder to pandas dataframe. But the list didn't come out sequentially. Does anybody know how to sort the data?

This is output of json_files:

['BuzzFeed_Real_5-Webpage.json',
 'BuzzFeed_Fake_9-Webpage.json',
 'BuzzFeed_Fake_6-Webpage.json',
 'BuzzFeed_Fake_5-Webpage.json',
 'BuzzFeed_Fake_8-Webpage.json',
 'BuzzFeed_Real_6-Webpage.json',
 'BuzzFeed_Real_7-Webpage.json',
 'BuzzFeed_Real_8-Webpage.json',
 'BuzzFeed_Real_9-Webpage.json',
 'BuzzFeed_Real_2-Webpage.json',
 'BuzzFeed_Real_4-Webpage.json',
 'BuzzFeed_Real_1-Webpage.json',
 'BuzzFeed_Real_10-Webpage.json',
 'BuzzFeed_Fake_4-Webpage.json',
 'BuzzFeed_Fake_10-Webpage.json',
 'BuzzFeed_Fake_1-Webpage.json',
 'BuzzFeed_Fake_2-Webpage.json',
 'BuzzFeed_Real_3-Webpage.json',
 'BuzzFeed_Fake_3-Webpage.json',
 'BuzzFeed_Fake_7-Webpage.json']

However, my label is sequential as follows: Label

    label
0   BuzzFeed_Real_1
1   BuzzFeed_Real_2
2   BuzzFeed_Real_3
3   BuzzFeed_Real_4
4   BuzzFeed_Real_5
5   BuzzFeed_Real_6
6   BuzzFeed_Real_7
7   BuzzFeed_Real_8
8   BuzzFeed_Real_9
9   BuzzFeed_Real_10
10  BuzzFeed_Fake_1
11  BuzzFeed_Fake_2
12  BuzzFeed_Fake_3
13  BuzzFeed_Fake_4
14  BuzzFeed_Fake_5
15  BuzzFeed_Fake_6
16  BuzzFeed_Fake_7
17  BuzzFeed_Fake_8
18  BuzzFeed_Fake_9
19  BuzzFeed_Fake_10

Does anybody know how to sort the data based on the label? Thank you

Here is my code:

        import os, json
import pandas as pd
import numpy as np

path_to_json = 'data/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('json')]
print(json_files)

#Here I define my pandas dataframe with the colums I want to get from json
jsons_data = pd.DataFrame(columns=['text','title'])

#We need both json and an index number so use enumerate()
for index, js in enumerate(json_files):
    with open(os.path.join(path_to_json,js)) as json_file:
        json_text = json.load(json_file)


        #the same structure 
        text = json_text['text']
        title = json_text['title']

        #Here I push a list of data into pandas DataFrame at row given by 'index'
        jsons_data.loc[index] = [text,title]

#Now that we have the pertinen json data in our DataFrame 
print(jsons_data)

and this is output of jsons_data:

text    title
0   Story highlights Obams reaffirms US commitment...   Obama in NYC: 'We all have a role to play' in ...
1   Well THAT’S Weird. If the Birther movement is ...   The AP, In 2004, Said Your Boy Obama Was BORN ...
2   The man arrested Monday in connection with the...   Bombing Suspect Filed Anti-Muslim Discriminati...
3   The Haitians in the audience have some newswor...   'Reporters' FLEE When Clintons Get EXPOSED!
4   Chicago Environmentalist Scumbags\n\nLeftists ...   The Black Sphere with Kevin Jackson
5   Obama weighs in on the debate\n\nPresident Bar...   Obama weighs in on the debate
6   Story highlights Ted Cruz refused to endorse T...   Donald Trump's rise puts Ted Cruz in a bind
7   Last week I wrote an article titled “Donald Tr...   More Milestone Moments for Donald Trump! – Eag...
8   Story highlights Trump has 45%, Clinton 42% an...   Georgia poll: Donald Trump, Hillary Clinton in...
9   Story highlights "This, though, is certain: to...   Hillary Clinton on police shootings: 'too many...
10  McCain Criticized Trump for Arpaio’s Pardon… S...   NFL Superstar Unleashes 4 Word Bombshell on Re...
11  On Saturday, September 17 at 8:30 pm EST, an e...   Another Terrorist Attack in NYC…Why Are we STI...
12  Less than a day after protests over the police...   Donald Trump: Drugs a 'Very, Very Big Factor' ...
13  Dolly Kyle has written a scathing “tell all” b...   HILLARY ON DISABLED CHILDREN During Easter Egg...
14  Former President Bill Clinton and his Clinton ...   Charity: Clinton Foundation Distributed “Water...
15  I woke up this morning to find a variation of ...   Proof The Mainstream Media Is Manipulating The...
16  Thanks in part to the declassification of Defe...   Declassified Docs Show That Obama Admin Create...
17  Critical Counties is a CNN series exploring 11...   Critical counties: Wake County, NC, could put ...
18  The Democrats are using an intimidation tactic...   Why is it “RACIST” to Question Someone’s Birth...
19  Back when the news first broke about the pay-t...   Clinton Foundation Spent 5.7% on Charity; Rest...
  • since u have the index, i would assume the index is mapped to each json - you could use dataframe.sort_index(). I would also suggest, u use pathlib ... that way json_files can be written as : list(Path(path_to_json).rglob('*.json')) just a suggestion. – sammywemmy Mar 25 '20 at 05:25
  • where can I add dataframe.sort_index() – Rosy Indah Permatasari Mar 25 '20 at 05:29
  • i'd suggest u share a sample of ur dataframe. easier to work with and make a clearer suggestion[link](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) kindly post data and not pics. just a couple of lines of ur dataframe showing the index as well – sammywemmy Mar 25 '20 at 05:36
  • I have shared my dataframe above.@sammywemmy – Rosy Indah Permatasari Mar 25 '20 at 05:46
  • ok... do u have a column for the labels? let's try it this way ... use pandas json : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html to get in ur data. let's say u have ur json files. u could write a list comprehension : [pd.read_json(file).assign(source = filename) for filename in list_filenames]. this is where we could appreciate pathlib. but try it and let's see... – sammywemmy Mar 25 '20 at 05:51
  • yes i have a columns for labels. I shared it above as well.. – Rosy Indah Permatasari Mar 25 '20 at 05:55
  • no labels column here... anyways try my suggestion and see if it works. if it fails, we can work together to get the data in using pathlib. I am biased towards it, hence my deference to it – sammywemmy Mar 25 '20 at 05:56
  • thank you so much for your help sammy! i will try first – Rosy Indah Permatasari Mar 25 '20 at 05:59
  • what should i changed in list_filenames? it said name 'list_filenames' is not defined? – Rosy Indah Permatasari Mar 25 '20 at 06:17
  • that should be ur list that contains the json files – sammywemmy Mar 25 '20 at 06:18

2 Answers2

0

You can use solution from this with split values for Fake and Real strings sorted descending and numbers are sorted ascending:

L = ['BuzzFeed_Real_5-Webpage.json',
 'BuzzFeed_Fake_9-Webpage.json',
 'BuzzFeed_Fake_6-Webpage.json',
 'BuzzFeed_Fake_5-Webpage.json',
 'BuzzFeed_Fake_8-Webpage.json',
 'BuzzFeed_Real_6-Webpage.json',
 'BuzzFeed_Real_7-Webpage.json',
 'BuzzFeed_Real_8-Webpage.json',
 'BuzzFeed_Real_9-Webpage.json',
 'BuzzFeed_Real_2-Webpage.json',
 'BuzzFeed_Real_4-Webpage.json',
 'BuzzFeed_Real_1-Webpage.json',
 'BuzzFeed_Real_10-Webpage.json',
 'BuzzFeed_Fake_4-Webpage.json',
 'BuzzFeed_Fake_10-Webpage.json',
 'BuzzFeed_Fake_1-Webpage.json',
 'BuzzFeed_Fake_2-Webpage.json',
 'BuzzFeed_Real_3-Webpage.json',
 'BuzzFeed_Fake_3-Webpage.json',
 'BuzzFeed_Fake_7-Webpage.json']

class reversor:
    def __init__(self, obj):
        self.obj = obj

    def __eq__(self, other):
        return other.obj == self.obj

    def __lt__(self, other):
        return other.obj < self.obj

a = sorted(L, key=lambda x: (reversor(x.split('_')[1]), int(x.split('_')[2].split('-')[0])))
print (a)
['BuzzFeed_Real_1-Webpage.json', 'BuzzFeed_Real_2-Webpage.json',
 'BuzzFeed_Real_3-Webpage.json', 'BuzzFeed_Real_4-Webpage.json', 
 'BuzzFeed_Real_5-Webpage.json', 'BuzzFeed_Real_6-Webpage.json', 
 'BuzzFeed_Real_7-Webpage.json', 'BuzzFeed_Real_8-Webpage.json', 
 'BuzzFeed_Real_9-Webpage.json', 'BuzzFeed_Real_10-Webpage.json', 
 'BuzzFeed_Fake_1-Webpage.json', 'BuzzFeed_Fake_2-Webpage.json', 
 'BuzzFeed_Fake_3-Webpage.json', 'BuzzFeed_Fake_4-Webpage.json', 
 'BuzzFeed_Fake_5-Webpage.json', 'BuzzFeed_Fake_6-Webpage.json', 
 'BuzzFeed_Fake_7-Webpage.json', 'BuzzFeed_Fake_8-Webpage.json', 
 'BuzzFeed_Fake_9-Webpage.json', 'BuzzFeed_Fake_10-Webpage.json']

Another similar idea by pandas - splitted values to new columns and last sorting by DataFrame.sort_values:

df = pd.DataFrame({'a':L})
df = df.join(df['a'].str.split('_', expand=True))
df['num'] = df[2].str.extract('(\d+)', expand=False).astype(int)
df = df.sort_values([1, 'num'], ascending=[False, True])
print (df)
                                a         0     1                2  num
11   BuzzFeed_Real_1-Webpage.json  BuzzFeed  Real   1-Webpage.json    1
9    BuzzFeed_Real_2-Webpage.json  BuzzFeed  Real   2-Webpage.json    2
17   BuzzFeed_Real_3-Webpage.json  BuzzFeed  Real   3-Webpage.json    3
10   BuzzFeed_Real_4-Webpage.json  BuzzFeed  Real   4-Webpage.json    4
0    BuzzFeed_Real_5-Webpage.json  BuzzFeed  Real   5-Webpage.json    5
5    BuzzFeed_Real_6-Webpage.json  BuzzFeed  Real   6-Webpage.json    6
6    BuzzFeed_Real_7-Webpage.json  BuzzFeed  Real   7-Webpage.json    7
7    BuzzFeed_Real_8-Webpage.json  BuzzFeed  Real   8-Webpage.json    8
8    BuzzFeed_Real_9-Webpage.json  BuzzFeed  Real   9-Webpage.json    9
12  BuzzFeed_Real_10-Webpage.json  BuzzFeed  Real  10-Webpage.json   10
15   BuzzFeed_Fake_1-Webpage.json  BuzzFeed  Fake   1-Webpage.json    1
16   BuzzFeed_Fake_2-Webpage.json  BuzzFeed  Fake   2-Webpage.json    2
18   BuzzFeed_Fake_3-Webpage.json  BuzzFeed  Fake   3-Webpage.json    3
13   BuzzFeed_Fake_4-Webpage.json  BuzzFeed  Fake   4-Webpage.json    4
3    BuzzFeed_Fake_5-Webpage.json  BuzzFeed  Fake   5-Webpage.json    5
2    BuzzFeed_Fake_6-Webpage.json  BuzzFeed  Fake   6-Webpage.json    6
19   BuzzFeed_Fake_7-Webpage.json  BuzzFeed  Fake   7-Webpage.json    7
4    BuzzFeed_Fake_8-Webpage.json  BuzzFeed  Fake   8-Webpage.json    8
1    BuzzFeed_Fake_9-Webpage.json  BuzzFeed  Fake   9-Webpage.json    9
14  BuzzFeed_Fake_10-Webpage.json  BuzzFeed  Fake  10-Webpage.json   10

a = df['a'].tolist()
print (a)
['BuzzFeed_Real_1-Webpage.json', 'BuzzFeed_Real_2-Webpage.json',
 'BuzzFeed_Real_3-Webpage.json', 'BuzzFeed_Real_4-Webpage.json', 
 'BuzzFeed_Real_5-Webpage.json', 'BuzzFeed_Real_6-Webpage.json', 
 'BuzzFeed_Real_7-Webpage.json', 'BuzzFeed_Real_8-Webpage.json', 
 'BuzzFeed_Real_9-Webpage.json', 'BuzzFeed_Real_10-Webpage.json', 
 'BuzzFeed_Fake_1-Webpage.json', 'BuzzFeed_Fake_2-Webpage.json', 
 'BuzzFeed_Fake_3-Webpage.json', 'BuzzFeed_Fake_4-Webpage.json', 
 'BuzzFeed_Fake_5-Webpage.json', 'BuzzFeed_Fake_6-Webpage.json', 
 'BuzzFeed_Fake_7-Webpage.json', 'BuzzFeed_Fake_8-Webpage.json', 
 'BuzzFeed_Fake_9-Webpage.json', 'BuzzFeed_Fake_10-Webpage.json']
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

This should give you what you need to create an index from the filenames. Let me know if you need help setting the index and if you want it a dual index or combine to a single index:

import os, json
import pandas as pd
import numpy as np

path_to_json = 'data/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('json')]
print(json_files)

#Here I define my pandas dataframe with the colums I want to get from json
jsons_data = pd.DataFrame(columns=['text','title'])

#We need both json and an index number so use enumerate()
for index, js in enumerate(json_files):
    with open(os.path.join(path_to_json,js)) as json_file:
        json_text = json.load(json_file)


        #the same structure 
        text = json_text['text']
        title = json_text['title']

        #Here I push a list of data into pandas DataFrame at row given by 'index'
        jsons_data.loc[index] = [text,title]

# Add column to your data frame containing 'json_files' list values
jsons_data['json_files'] = json_files

import re

# Create Regex to identify 'Fake' or 'Real' BuzzFeed
news_type = r"(Fake|Real)"

# Create Regex to extract numeric count
news_type_count = r"(\d+)"

# Extract new type to column
jsons_data['news_type'] = jsons_data['json_files'].str.extract(pat=news_type)

# Extract numeric count to columne
jsons_data['news_type_count'] = jsons_data['json_files'].str.extract(pat=news_type_count)

# Convert numeric count to integer
jsons_data['news_type_count'] = jsons_data['news_type_count'].astype(int)

# Sort dataframe by 'news_type' and 'news_type_count'
jsons_data = jsons_data.sort_values(by=['news_type', 'news_type_count'])

# Print head of dataframe
print(jsons_data.head())