2

i am having a zipfile in that zip file i am having one text with json objects

zip = zipfile.ZipFile('test.zip')
file = zip.read('text1.txt')
print file 

i am getting file type as str

{"id":"1","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036",0]]}
{"id":"2","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036",0]]}
{"id":"3","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036",0]]}
{"id":"4","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036",0]]}
{"id":"5","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036",0]]}

when I am reading I am getting a string how can I convert this string data frame like

id f data
1  A  2040
1  A  2039
1  A  2038
1  A  2037
1  A  2036
raam
  • 21
  • 6

2 Answers2

0

Use for same size of lists - 4:

print (type(df.loc[1, 'data']))
<class 'list'>

from itertools import chain

#extract first value in each list
data = [[y[0] for y in x] for x in df['data']]
df = pd.DataFrame({
    'id' : df['id'].values.repeat(4),
    'type' : df['type'].values.repeat(4),
    'data' : list(chain.from_iterable(data))
})

More general solution with not equal lists sizes is use str.len:

from itertools import chain

data = [[y[0] for y in x] for x in df['data']]
lens = [len(y) for y in data]
df = pd.DataFrame({
    'id' : df['id'].values.repeat(lens),
    'type' : df['type'].values.repeat(lens),
    'data' : list(chain.from_iterable(data))
})

print (df)

      id type  data
0      1    a  2015
1      1    a  2016
2      1    a  2017
3      1    a  2020
4      2    b  2015
5      2    b  2016
6      2    b  2017
7      2    b  2020
8      3    a  2018
9      3    a  2019
10     3    a  2011
11     3    a  2010
12     4    a  2015
13     4    a  2016
14     4    a  2017
15     4    a  2020
16     5    b  2015
17     5    b  2016
18     5    b  2017
19     5    b  2020
20     6    b  2015
21     6    b  2016
22     6    b  2017
23     6    b  2020
24  5000    a  2015
25  5000    a  2016
26  5000    a  2017
27  5000    a  2020
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

Edited

Sorry for the misunderstanding of the question. I think the conversion of the zip file can be done as follows. I assume that the 1st and last positions of the file are { and } respectively without any newlines symbols.

data = eval(f"[{file.replace('\n', ',')}")

Also being inspired by this post I can offer a solution using pure pandas. If we import the data into a pandas DataFrame, by passing to the constructor a list of the dictionaries above, the problem becomes in how to explode a list within a DataFrame's cell.

This is done by the combination of .apply(pd.Series) that creates multiple columns for the different elements within the list and pd.melt that transforms those columns back into a single column. By previously setting as index the other columns we can save them for the resulting DataFrame.

Here the code:

   # Formatting the data to be introduced in the pd.DataFrame
    data = [{"id":"1","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036",0]]},
            {"id":"2","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036",0]]},
            {"id":"3","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036",0]]},
            {"id":"4","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036",0]]},
            {"id":"5","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036", 0]]}
           ]

    # And the piece of code
    (pd.melt((pd.DataFrame(data)
               .assign(data=lambda x: [[l[0] for l in ls] for ls in x.data])
               .set_index(['id', 'f'])
               .data.apply(pd.Series)
               .reset_index()
              ), 
             id_vars=['id', 'f'],
             value_name='data'
             )
     .set_index(['id', 'f'])
     .drop('variable', axis=1)
     .dropna()
     .sort_index()
     .reset_index()
     )

    id  f   data
0   1   A   2040
1   1   A   2039
2   1   A   2038
3   1   A   2037
4   1   A   2036
5   2   A   2040
6   2   A   2039
7   2   A   2038
8   2   A   2037
9   2   A   2036
10  3   A   2040
11  3   A   2039
12  3   A   2038
13  3   A   2037
14  3   A   2036
15  4   A   2040
16  4   A   2039
17  4   A   2038
18  4   A   2037
19  4   A   2036
20  5   A   2040
21  5   A   2039
22  5   A   2038
23  5   A   2037
24  5   A   2036
gonzalo mr
  • 144
  • 12