0

The following code displays data from a JSON Line file.

import pandas as pd
import numpy


start = time.time()
with open('stela_zerrl_t01_201222_084053_test_edited.json', 'r') as fin:
    df = pd.read_json(fin, lines=True)
    parsed_data = df[["SRC/Word1"]].drop_duplicates().replace('', np.NAN).dropna().values.tolist()
    
print(parsed_data)



The output is:

[[' '], ['E1F25701'], ['E15511D7']]

Is there a way remove the blank data, duplicates, and store it as an array?

enter image description here

minTwin
  • 1,181
  • 2
  • 21
  • 35
  • 1
    [drop_duplicates](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) to remove duplicates, [boolean indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing) to filter out blank rows and [to_numpy](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html) to convert to a numpy array – It_is_Chris Jul 21 '21 at 18:21

3 Answers3

1

You can use .dropna(),.drop_duplicates().

parsed_data=parsed_data.drop.duplicates()
parsed_data.dropna(how='all', inplace = True) 
# do operation inplace your dataframe and return None.

parsed_data= parsed_data.dropna(how='all') 
# don't do operation inplace and return a dataframe as a result. 
# Hence this result must be assigned to your dataframe

If directly .dropna() not working for you then you may use .replace('', numpy.NaN,inplace=True).

Or you can try this too:

json_df=json_df[json_df['SRC/Word1'].str.strip().astype(bool)]

This is faster then doing .replace(), if you have empty string.

And now we cleaned it, we can just use .values.tolist() to get those value in list.

imxitiz
  • 3,920
  • 3
  • 9
  • 33
  • dropna doesnt work says it doesn't exist. The replace method have the output blank. – minTwin Jul 21 '21 at 18:38
  • 1
    @minTwin I had already told you, `inplace=True` doesn't return any values! – imxitiz Jul 21 '21 at 18:40
  • I get the following output: [[' '] ['E1F25701'] ['E15511D7']] – minTwin Jul 21 '21 at 18:46
  • parsed_data = json_df[["SRC/Word1"]].drop_duplicates().dropna(how = 'all').to_numpy() Produces the result. The data all has to be in a single array. – minTwin Jul 21 '21 at 18:47
  • There is an error: 'DataFrame' object has no attribute 'to_list' if I change to_numpy to to_list() – minTwin Jul 21 '21 at 18:51
  • It is outputing multiple arrays. Is there a way to make all the data be stored in a single array? – minTwin Jul 21 '21 at 18:53
  • What does that mean? Example please! – imxitiz Jul 21 '21 at 18:54
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/235159/discussion-between-mintwin-and-xitiz). – minTwin Jul 21 '21 at 18:55
  • 1
    @minTwin do this `json_df=json_df[json_df['SRC/Word1'].str.strip().astype(bool)]` – imxitiz Jul 21 '21 at 19:04
  • @minTwin According to [this](https://stackoverflow.com/a/56708633) doing `.astype()` is faster then any other. – imxitiz Jul 21 '21 at 19:09
  • @minTwin you are telling that you get many array, does this solve your that problem? `prased_data=[[' '], ['E1F25701'], ['E15511D7']] print([a[0] for a in prased_data]) ` – imxitiz Jul 21 '21 at 19:26
1

Yup! Pandas has built-in functions for all of these operations:

import pandas as pd
df = pd.read_json('stela_zerrl_t01_201222_084053_test_edited.json', lines=True)
series = df['SRC/Word1']
no_dupes = series.drop_duplicates()
no_blanks = no_dupes.dropna()
final_list = no_blanks.tolist()

If you want an numpy array rather than a python list, you can change the last line to the following:

final_array = no_blanks.to_numpy()
bpgeck
  • 1,592
  • 1
  • 14
  • 32
  • The following gets an error: ValueError: Trailing data – minTwin Jul 21 '21 at 18:34
  • 1
    @minTwin Ah so this occurs when your JSON file has more than one JSON block in it. A separate issue from this post: https://stackoverflow.com/questions/30088006/loading-a-file-with-more-than-one-line-of-json-into-pandas – bpgeck Jul 21 '21 at 18:42
  • @minTwin Fixed my answer to use `lines=True` to prevent that error – bpgeck Jul 21 '21 at 19:41
1

Drop the duplicates, replace empty string by NaN, then create the list.

>>> df.drop_duplicates().replace('', np.NAN).dropna().values.tolist()
[['E1F25701'], ['E15511D7']]

PS: Since what you have is a dataframe so it will be a 2D List, if you want 1D list, you may do it for the specific column:

>>> df['SRC/Word1'].drop_duplicates().replace('', np.NAN).dropna().tolist()
['E1F25701', 'E15511D7']

What you have is not an empty string, but white space character, Try this: replace \s+ to np.NAN with regex=True:

>>>df['SRC/Word1'].drop_duplicates().replace('\s+', np.NAN, regex=True).dropna().tolist()
['E1F25701', 'E15511D7']

And apparently, below will also work:

df['SRC/Word1'].drop_duplicates().str.strip().replace('', np.NAN).dropna().tolist()
['E1F25701', 'E15511D7']
ThePyGuy
  • 17,779
  • 5
  • 18
  • 45