create a dataframe from str with python pandas

Question

i am having a zipfile in that zip file i am having one text with json objects

zip = zipfile.ZipFile('test.zip')
file = zip.read('text1.txt')
print file

i am getting file type as str

{"id":"1","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036",0]]}
{"id":"2","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036",0]]}
{"id":"3","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036",0]]}
{"id":"4","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036",0]]}
{"id":"5","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036",0]]}

when I am reading I am getting a string how can I convert this string data frame like

@ jezrael can you see my edited question and give me the solution — raam, Jul 23 '18 at 14:33
I think need `import ast` and `df['data'] = df['data'].apply(ast.literal_eval)` before my solution. — jezrael, Jul 23 '18 at 15:01
@ jezrael can you find my new question Find that start date and end dates are available using python pandas — raam, Aug 08 '18 at 12:31

score 0 · Accepted Answer · answered Jul 23 '18 at 10:52

Use for same size of lists - 4:

print (type(df.loc[1, 'data']))
<class 'list'>

from itertools import chain

#extract first value in each list
data = [[y[0] for y in x] for x in df['data']]
df = pd.DataFrame({
    'id' : df['id'].values.repeat(4),
    'type' : df['type'].values.repeat(4),
    'data' : list(chain.from_iterable(data))
})

More general solution with not equal lists sizes is use str.len:

from itertools import chain

data = [[y[0] for y in x] for x in df['data']]
lens = [len(y) for y in data]
df = pd.DataFrame({
    'id' : df['id'].values.repeat(lens),
    'type' : df['type'].values.repeat(lens),
    'data' : list(chain.from_iterable(data))
})

print (df)

      id type  data
0      1    a  2015
1      1    a  2016
2      1    a  2017
3      1    a  2020
4      2    b  2015
5      2    b  2016
6      2    b  2017
7      2    b  2020
8      3    a  2018
9      3    a  2019
10     3    a  2011
11     3    a  2010
12     4    a  2015
13     4    a  2016
14     4    a  2017
15     4    a  2020
16     5    b  2015
17     5    b  2016
18     5    b  2017
19     5    b  2020
20     6    b  2015
21     6    b  2016
22     6    b  2017
23     6    b  2020
24  5000    a  2015
25  5000    a  2016
26  5000    a  2017
27  5000    a  2020

gonzalo mr · Answer 2 · 2018-07-24T16:46:22.083

Edited

Sorry for the misunderstanding of the question. I think the conversion of the zip file can be done as follows. I assume that the 1st and last positions of the file are { and } respectively without any newlines symbols.

data = eval(f"[{file.replace('\n', ',')}")

Also being inspired by this post I can offer a solution using pure pandas. If we import the data into a pandas DataFrame, by passing to the constructor a list of the dictionaries above, the problem becomes in how to explode a list within a DataFrame's cell.

This is done by the combination of .apply(pd.Series) that creates multiple columns for the different elements within the list and pd.melt that transforms those columns back into a single column. By previously setting as index the other columns we can save them for the resulting DataFrame.

Here the code:

   # Formatting the data to be introduced in the pd.DataFrame
    data = [{"id":"1","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036",0]]},
            {"id":"2","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036",0]]},
            {"id":"3","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036",0]]},
            {"id":"4","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036",0]]},
            {"id":"5","f":"A","data":[["2040",0],["2039",0],["2038",0],["2037",0],["2036", 0]]}
           ]

    # And the piece of code
    (pd.melt((pd.DataFrame(data)
               .assign(data=lambda x: [[l[0] for l in ls] for ls in x.data])
               .set_index(['id', 'f'])
               .data.apply(pd.Series)
               .reset_index()
              ), 
             id_vars=['id', 'f'],
             value_name='data'
             )
     .set_index(['id', 'f'])
     .drop('variable', axis=1)
     .dropna()
     .sort_index()
     .reset_index()
     )

    id  f   data
0   1   A   2040
1   1   A   2039
2   1   A   2038
3   1   A   2037
4   1   A   2036
5   2   A   2040
6   2   A   2039
7   2   A   2038
8   2   A   2037
9   2   A   2036
10  3   A   2040
11  3   A   2039
12  3   A   2038
13  3   A   2037
14  3   A   2036
15  4   A   2040
16  4   A   2039
17  4   A   2038
18  4   A   2037
19  4   A   2036
20  5   A   2040
21  5   A   2039
22  5   A   2038
23  5   A   2037
24  5   A   2036

how can i convert string to dataframe – raam Jul 24 '18 at 03:52 — raam, Jul 24 '18 at 03:52

create a dataframe from str with python pandas

2 Answers2