How to read a file .txt containing an array in it?

Question

I want to read data from file into a DataFrame. But this file is a special format. Include so many lines like this:

year = [1, 2, 3]

age = [4, 5, 6]

And this is the link go to the special file: https://github.com/cuongpiger/Py-for-ML-DS-DV/blob/master/Matplotlib/Chap6_data/dulieu_year_gap_pop_life.txt

If it's for a training exercise, are you sure you aren't supposed to just copy and paste it into your code? — David Buck, Nov 01 '19 at 07:27
https://stackoverflow.com/a/43602645/2988730. Shameless plug, but perfectly suited here. — Mad Physicist, Nov 01 '19 at 07:31
The lines are not the same length, so you cant really put them into a dataframe. They got the length of `151,151,142,142,142` — luigigi, Nov 01 '19 at 07:37

score 3 · Accepted Answer · answered Nov 01 '19 at 08:04

If need all values to DataFrame create dictionary of Series and pass to DataFrame constructor with ast.literal_eval for parse lists:

import ast

d = {}
with open('dulieu_year_gap_pop_life.txt') as file:
    splitted = file.readlines()
    for x in splitted:
        h, data = x.strip().split(' = ')
        d[h] = pd.Series(ast.literal_eval(data))

df = pd.DataFrame(d)
print (df)
     year    pop       gdp_cap  life_exp  life_exp1950
0    1950   2.53    974.580338    43.828         28.80
1    1951   2.57   5937.029526    76.423         55.23
2    1952   2.62   6223.367465    72.301         43.08
3    1953   2.67   4797.231267    42.731         30.02
4    1954   2.71  12779.379640    75.320         62.48
..    ...    ...           ...       ...           ...
146  2096  10.81           NaN       NaN           NaN
147  2097  10.82           NaN       NaN           NaN
148  2098  10.83           NaN       NaN           NaN
149  2099  10.84           NaN       NaN           NaN
150  2100  10.85           NaN       NaN           NaN

[151 rows x 5 columns]

For only 2 columns use:

df = pd.DataFrame(d, columns=['year','pop'])
print (df)
     year    pop
0    1950   2.53
1    1951   2.57
2    1952   2.62
3    1953   2.67
4    1954   2.71
..    ...    ...
146  2096  10.81
147  2097  10.82
148  2098  10.83
149  2099  10.84
150  2100  10.85

[151 rows x 2 columns]

luigigi · Answer 2 · 2019-11-01T08:14:11.913

Since the length of the lists in your input file aren't the same length you can't put them in one DataFrame. For the first two lists which are the same length, the following would work:

import requests

url = 'https://raw.githubusercontent.com/cuongpiger/Py-for-ML-DS-DV/master/Matplotlib/Chap6_data/dulieu_year_gap_pop_life.txt'
response = requests.get(url)
a = response.content.decode('utf-8')
df = pd.DataFrame()
for i in a.splitlines()[:2]:
    df[i.split()[0]] = [x.replace(']','').replace('[','').replace(',','') for x in i.split()[2:]]

df
Out: 
     year    pop
0    1950   2.53
1    1951   2.57
2    1952   2.62
3    1953   2.67
4    1954   2.71
..    ...    ...
146  2096  10.81
147  2097  10.82
148  2098  10.83
149  2099  10.84
150  2100  10.85
[151 rows x 2 columns]

score 0 · Answer 3 · answered Nov 01 '19 at 10:16

With the help of Regex :

import pandas as pd
import re

file = open('dulieu_year_gap_pop_life.txt','r')

# Empty Dataframe
df = pd.DataFrame()     

for line in file.readlines():
    group = re.match('(.*) = (.*)',line)
    df[group[1]] = pd.Series(eval(group[2]))

How to read a file .txt containing an array in it?

3 Answers3