2

I want to read data from file into a DataFrame. But this file is a special format. Include so many lines like this:

year = [1, 2, 3]

age = [4, 5, 6]

And this is the link go to the special file: https://github.com/cuongpiger/Py-for-ML-DS-DV/blob/master/Matplotlib/Chap6_data/dulieu_year_gap_pop_life.txt

Mike
  • 1,048
  • 2
  • 11
  • 23
Claire Duong
  • 103
  • 1
  • 7

3 Answers3

3

If need all values to DataFrame create dictionary of Series and pass to DataFrame constructor with ast.literal_eval for parse lists:

import ast

d = {}
with open('dulieu_year_gap_pop_life.txt') as file:
    splitted = file.readlines()
    for x in splitted:
        h, data = x.strip().split(' = ')
        d[h] = pd.Series(ast.literal_eval(data))

df = pd.DataFrame(d)
print (df)
     year    pop       gdp_cap  life_exp  life_exp1950
0    1950   2.53    974.580338    43.828         28.80
1    1951   2.57   5937.029526    76.423         55.23
2    1952   2.62   6223.367465    72.301         43.08
3    1953   2.67   4797.231267    42.731         30.02
4    1954   2.71  12779.379640    75.320         62.48
..    ...    ...           ...       ...           ...
146  2096  10.81           NaN       NaN           NaN
147  2097  10.82           NaN       NaN           NaN
148  2098  10.83           NaN       NaN           NaN
149  2099  10.84           NaN       NaN           NaN
150  2100  10.85           NaN       NaN           NaN

[151 rows x 5 columns]

For only 2 columns use:

df = pd.DataFrame(d, columns=['year','pop'])
print (df)
     year    pop
0    1950   2.53
1    1951   2.57
2    1952   2.62
3    1953   2.67
4    1954   2.71
..    ...    ...
146  2096  10.81
147  2097  10.82
148  2098  10.83
149  2099  10.84
150  2100  10.85

[151 rows x 2 columns]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Since the length of the lists in your input file aren't the same length you can't put them in one DataFrame. For the first two lists which are the same length, the following would work:

import requests

url = 'https://raw.githubusercontent.com/cuongpiger/Py-for-ML-DS-DV/master/Matplotlib/Chap6_data/dulieu_year_gap_pop_life.txt'
response = requests.get(url)
a = response.content.decode('utf-8')
df = pd.DataFrame()
for i in a.splitlines()[:2]:
    df[i.split()[0]] = [x.replace(']','').replace('[','').replace(',','') for x in i.split()[2:]]

df
Out: 
     year    pop
0    1950   2.53
1    1951   2.57
2    1952   2.62
3    1953   2.67
4    1954   2.71
..    ...    ...
146  2096  10.81
147  2097  10.82
148  2098  10.83
149  2099  10.84
150  2100  10.85
[151 rows x 2 columns]
luigigi
  • 4,146
  • 1
  • 13
  • 30
0

With the help of Regex :

import pandas as pd
import re

file = open('dulieu_year_gap_pop_life.txt','r')

# Empty Dataframe
df = pd.DataFrame()     

for line in file.readlines():
    group = re.match('(.*) = (.*)',line)
    df[group[1]] = pd.Series(eval(group[2]))
GIRISH kuniyal
  • 740
  • 1
  • 5
  • 14