How to skip duplicates and blank values from JSON dataframe and store into an array?

Question

The following code displays data from a JSON Line file.

import pandas as pd
import numpy


start = time.time()
with open('stela_zerrl_t01_201222_084053_test_edited.json', 'r') as fin:
    df = pd.read_json(fin, lines=True)
    parsed_data = df[["SRC/Word1"]].drop_duplicates().replace('', np.NAN).dropna().values.tolist()
    
print(parsed_data)

The output is:

[[' '], ['E1F25701'], ['E15511D7']]

Is there a way remove the blank data, duplicates, and store it as an array?

[drop_duplicates](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) to remove duplicates, [boolean indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing) to filter out blank rows and [to_numpy](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html) to convert to a numpy array — It_is_Chris, Jul 21 '21 at 18:21

imxitiz · Answer 1 · 2021-07-21T19:15:14.983

1

You can use .dropna(),.drop_duplicates().

parsed_data=parsed_data.drop.duplicates()
parsed_data.dropna(how='all', inplace = True) 
# do operation inplace your dataframe and return None.

parsed_data= parsed_data.dropna(how='all') 
# don't do operation inplace and return a dataframe as a result. 
# Hence this result must be assigned to your dataframe

If directly .dropna() not working for you then you may use .replace('', numpy.NaN,inplace=True).

Or you can try this too:

json_df=json_df[json_df['SRC/Word1'].str.strip().astype(bool)]

This is faster then doing .replace(), if you have empty string.

And now we cleaned it, we can just use .values.tolist() to get those value in list.

edited Jul 21 '21 at 19:15

answered Jul 21 '21 at 18:22

imxitiz

3,920
3
9
33

dropna doesnt work says it doesn't exist. The replace method have the output blank. – minTwin Jul 21 '21 at 18:38
1

@minTwin I had already told you, `inplace=True` doesn't return any values! – imxitiz Jul 21 '21 at 18:40
I get the following output: [[' '] ['E1F25701'] ['E15511D7']] – minTwin Jul 21 '21 at 18:46
parsed_data = json_df[["SRC/Word1"]].drop_duplicates().dropna(how = 'all').to_numpy() Produces the result. The data all has to be in a single array. – minTwin Jul 21 '21 at 18:47
There is an error: 'DataFrame' object has no attribute 'to_list' if I change to_numpy to to_list() – minTwin Jul 21 '21 at 18:51
It is outputing multiple arrays. Is there a way to make all the data be stored in a single array? – minTwin Jul 21 '21 at 18:53
What does that mean? Example please! – imxitiz Jul 21 '21 at 18:54
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/235159/discussion-between-mintwin-and-xitiz). – minTwin Jul 21 '21 at 18:55
1

@minTwin do this `json_df=json_df[json_df['SRC/Word1'].str.strip().astype(bool)]` – imxitiz Jul 21 '21 at 19:04
@minTwin According to [this](https://stackoverflow.com/a/56708633) doing `.astype()` is faster then any other. – imxitiz Jul 21 '21 at 19:09
@minTwin you are telling that you get many array, does this solve your that problem? `prased_data=[[' '], ['E1F25701'], ['E15511D7']] print([a[0] for a in prased_data]) ` – imxitiz Jul 21 '21 at 19:26

bpgeck · Answer 2 · 2021-07-21T19:41:09.507

1

Yup! Pandas has built-in functions for all of these operations:

import pandas as pd
df = pd.read_json('stela_zerrl_t01_201222_084053_test_edited.json', lines=True)
series = df['SRC/Word1']
no_dupes = series.drop_duplicates()
no_blanks = no_dupes.dropna()
final_list = no_blanks.tolist()

If you want an numpy array rather than a python list, you can change the last line to the following:

final_array = no_blanks.to_numpy()

edited Jul 21 '21 at 19:41

answered Jul 21 '21 at 18:23

bpgeck

1,592
1
14
32

The following gets an error: ValueError: Trailing data – minTwin Jul 21 '21 at 18:34
1

@minTwin Ah so this occurs when your JSON file has more than one JSON block in it. A separate issue from this post: https://stackoverflow.com/questions/30088006/loading-a-file-with-more-than-one-line-of-json-into-pandas – bpgeck Jul 21 '21 at 18:42
@minTwin Fixed my answer to use `lines=True` to prevent that error – bpgeck Jul 21 '21 at 19:41

ThePyGuy · Answer 3 · 2021-07-21T19:08:31.090

1

Drop the duplicates, replace empty string by NaN, then create the list.

>>> df.drop_duplicates().replace('', np.NAN).dropna().values.tolist()
[['E1F25701'], ['E15511D7']]

PS: Since what you have is a dataframe so it will be a 2D List, if you want 1D list, you may do it for the specific column:

>>> df['SRC/Word1'].drop_duplicates().replace('', np.NAN).dropna().tolist()
['E1F25701', 'E15511D7']

What you have is not an empty string, but white space character, Try this: replace \s+ to np.NAN with regex=True:

>>>df['SRC/Word1'].drop_duplicates().replace('\s+', np.NAN, regex=True).dropna().tolist()
['E1F25701', 'E15511D7']

And apparently, below will also work:

df['SRC/Word1'].drop_duplicates().str.strip().replace('', np.NAN).dropna().tolist()
['E1F25701', 'E15511D7']

edited Jul 21 '21 at 19:08

answered Jul 21 '21 at 18:53

ThePyGuy

17,779
5
18
45

This output is: [[' '], ['E1F25701'], ['E15511D7']] is there a way to put all the data into a single array? – minTwin Jul 21 '21 at 18:58
1

Can you add the output screenshot to the question along with the solution I have provided? That shouldn't be the output. – ThePyGuy Jul 21 '21 at 18:59
Sure one moment – minTwin Jul 21 '21 at 19:01
1

Okay, I saw it, that is not an empty string but space character, try the last solution in the answer with `regex=True` – ThePyGuy Jul 21 '21 at 19:04
Produces the same output – minTwin Jul 21 '21 at 19:05
`df['SRC/Word1'].drop_duplicates().replace('\s+', np.NAN, regex=True).dropna().tolist()` copy this code and run it on jupyter notebook. – ThePyGuy Jul 21 '21 at 19:05
1

This one should definitely work: `df['SRC/Word1'].drop_duplicates().str.strip().replace('', np.NAN).dropna().tolist()` – ThePyGuy Jul 21 '21 at 19:11

How to skip duplicates and blank values from JSON dataframe and store into an array?

3 Answers3