0

In Python 3 and pandas I loaded several TXT files. They have no header and have the same structure - 46 columns and the same information theme in each column Example of three cases

candidatos1 = pd.read_csv("candidatos_2014/consulta_cand_2014_AC.txt",sep=';', header=None, encoding = 'latin_1') 

candidatos1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 621 entries, 0 to 620
Data columns (total 46 columns):
0     621 non-null object
1     621 non-null object
2     621 non-null int64
3     621 non-null int64
4     621 non-null object
5     621 non-null object
6     621 non-null object
7     621 non-null object
8     621 non-null int64
9     621 non-null object
10    621 non-null object
11    621 non-null int64
12    621 non-null int64
13    621 non-null int64
14    621 non-null object
15    621 non-null int64
16    621 non-null object
17    621 non-null int64
18    621 non-null object
19    621 non-null object
20    621 non-null int64
21    621 non-null object
22    621 non-null object
23    621 non-null object
24    621 non-null int64
25    621 non-null object
26    621 non-null object
27    621 non-null int64
28    621 non-null int64
29    621 non-null int64
30    621 non-null object
31    621 non-null int64
32    621 non-null object
33    621 non-null int64
34    621 non-null object
35    621 non-null int64
36    621 non-null object
37    621 non-null int64
38    621 non-null object
39    621 non-null object
40    621 non-null int64
41    621 non-null object
42    621 non-null int64
43    621 non-null int64
44    621 non-null object
45    621 non-null object
dtypes: int64(20), object(26)
memory usage: 223.2+ KB

candidatos2 = pd.read_csv("candidatos_2014/consulta_cand_2014_AL.txt",sep=';', header=None, encoding = 'latin_1') 
candidatos2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 479 entries, 0 to 478
Data columns (total 46 columns):
0     479 non-null object
1     479 non-null object
2     479 non-null int64
3     479 non-null int64
4     479 non-null object
5     479 non-null object
6     479 non-null object
7     479 non-null object
8     479 non-null int64
9     479 non-null object
10    479 non-null object
11    479 non-null int64
12    479 non-null int64
13    479 non-null int64
14    479 non-null object
15    479 non-null int64
16    479 non-null object
17    479 non-null int64
18    479 non-null object
19    479 non-null object
20    479 non-null int64
21    479 non-null object
22    479 non-null object
23    479 non-null object
24    479 non-null int64
25    479 non-null object
26    479 non-null object
27    479 non-null int64
28    479 non-null int64
29    479 non-null int64
30    479 non-null object
31    479 non-null int64
32    479 non-null object
33    479 non-null int64
34    479 non-null object
35    479 non-null int64
36    479 non-null object
37    479 non-null int64
38    479 non-null object
39    479 non-null object
40    479 non-null int64
41    479 non-null object
42    479 non-null int64
43    479 non-null int64
44    479 non-null object
45    479 non-null object
dtypes: int64(20), object(26)
memory usage: 172.2+ KB

candidatos3 = pd.read_csv("candidatos_2014/consulta_cand_2014_AM.txt",sep=';', header=None, encoding = 'latin_1') 
candidatos3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 786 entries, 0 to 785
Data columns (total 46 columns):
0     786 non-null object
1     786 non-null object
2     786 non-null int64
3     786 non-null int64
4     786 non-null object
5     786 non-null object
6     786 non-null object
7     786 non-null object
8     786 non-null int64
9     786 non-null object
10    786 non-null object
11    786 non-null int64
12    786 non-null int64
13    786 non-null int64
14    786 non-null object
15    786 non-null int64
16    786 non-null object
17    786 non-null int64
18    786 non-null object
19    786 non-null object
20    786 non-null int64
21    786 non-null object
22    786 non-null object
23    786 non-null object
24    786 non-null int64
25    786 non-null object
26    786 non-null object
27    786 non-null int64
28    786 non-null int64
29    786 non-null int64
30    786 non-null object
31    786 non-null int64
32    786 non-null object
33    786 non-null int64
34    786 non-null object
35    786 non-null int64
36    786 non-null object
37    786 non-null int64
38    786 non-null object
39    786 non-null object
40    786 non-null int64
41    786 non-null object
42    786 non-null int64
43    786 non-null int64
44    786 non-null object
45    786 non-null object
dtypes: int64(20), object(26)
memory usage: 282.5+ KB

Please, is there a way to load these files all at once in a single dataframe?

Or do I need to load one at a time and then gather all the dataframes? How?

Reinaldo Chaves
  • 965
  • 4
  • 16
  • 43
  • 1
    There are two, efficient ways to do it. Iterate over file reads and append data to a list, and subsequently convert to a dataframe or iterate over file reads and, line by line, append to a dataframe. – tadamhicks Feb 21 '18 at 02:26

2 Answers2

5

In these situation I like to feed pandas.concat a list comprehension.

from pathlib import Path
import pandas

def _reader(fname):
    return pandas.read_csv(fname, sep=';', header=None, encoding='latin_1')

folder = Path("candidatos_2014")
df = pandas.concat([
    _reader(txt)
    for txt in folder.glob("*.txt")
])
Paul H
  • 65,268
  • 20
  • 159
  • 136
  • 1
    I hate to even say it... but `pathlib2` for those still using Python 2. That out of the way. Nice answer and use of `Path` (-: – piRSquared Feb 21 '18 at 03:00
  • Please, the thirteenth column is being loaded as int64. But it has codes that start with zero on the left. I wrote this command in read_csv, but still importing as int64 – Reinaldo Chaves Feb 21 '18 at 03:16
  • pd.read_csv(fname, sep=';', header=None, encoding='latin_1', converters={13: lambda x: str(x)}) – Reinaldo Chaves Feb 21 '18 at 03:16
  • Please, is there any other way to keep as an object? – Reinaldo Chaves Feb 21 '18 at 03:17
  • 1
    @ReinaldoChaves I searched stackoverflow for "pandas leading zeros" and found this: https://stackoverflow.com/questions/23836277/add-leading-zeros-to-strings-in-pandas-dataframe (first hit) – Paul H Feb 21 '18 at 14:58
1

You can append dataframes after they are created like so:

    candidatos1.append(candidatos2,ignore_index=True).append(candidatos3,ignore_index=True)

You could concatenate the text files first and then load into Pandas but that's outside of Pandas.

Sachin Myneni
  • 273
  • 3
  • 14