0

I'm trying to build a number of dataframes from some data whose content (in terms of variables, not values) could potentially change in each row within the same dataframe.

The way I'm trying to do it now is to build a new 1-row dataframe for each new row and then append it to the existing dataframe using the append method. This takes care of creating new columns and set the value to NaN for the existing rows.

I also tried the loc method as suggested here, but this returns a ValueError.

In addition to this, I could have more than one thousand possible labels. So I would like to avoid to have to explicitly declare all the columns at the beginning and it's almost impossible to know which are all the columns that will be needed for a particular file without reading the whole file at least once.

I know, however, that building a dataframe line-by-line is considered a bad (if not deprecated) practice.

So, lets say my data comes from a text file somehow similar to this:

A=10,B=2
A=20,B=3
A=30,C=Batman

and I want to create a dataframe that looks like

    a    b       c
0  10  2.0     NaN
0  20  3.0     NaN
0  30  NaN  Batman

How would you suggest to do it?

EDIT: The data comes from a very messy fixed width text. Each line of the text file is a continuous sequence of chars (no delimiter). Inside the line there are 3 letters identifier that marks the beginning of a section, followed by the values for that section all together. I have a document, that I translated into a python dict, that tells me for each identifier how many chars I need to read after then beginning of the section and how they are divided.

Eg. One line could be

AAA1234BBB789aa78CCC123456

I would then know that section AAA is follower by 3 values, one made of a 2 digits int, and two made of one digit int. That section BBB is followed by a 3 digit int, a 2 char string and 2 one-digit ints.

I have a piece of code that translate this into a dict that looks like

{'AAA_1': 12, 'AAA_2':3, 'AAA_3':4, 'BBB_1':789, 'BBB_2':aa,'BBB_3':7, 'BBB_4':8, ......}

EDIT2: If you want to have a glimpse of an original file, you can look here (any of them will work):

ftp://ftp.ncdc.noaa.gov/pub/data/noaa/2017/

And to understand how to read it look here (didn't want to ask you so much):

ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ish-format-document.pdf

Luca
  • 1,610
  • 1
  • 19
  • 30

3 Answers3

0

You can make each row as a dictionary and then combine them into a dataframe

dic1 = {'a':10,'b':2}
dic2 = {'a':20,'b':3}
dic3 = {'a':30,'c':'batman'}

pd.DataFrame(data=[dic1,dic2,dic3])
Ashok KS
  • 659
  • 5
  • 21
  • As simple as it is, this ended up being the most practical solution and for ~75'000 lines takes only 1-2 of seconds. – Luca Oct 26 '18 at 22:19
0

It looks like your data is more suitably stored as JSON. If you can convert your data into JSON as follows:

Contents of data.json:

[
   {"A":10, "B":2},
   {"A":20, "B":3},
   {"A":30, "C":"Batman"}
]

Then you can simply do:

>>> df = pd.read_json('data.json')
>>> print(df)
    A    B       C
0  10  2.0     NaN
1  20  3.0     NaN
2  30  NaN  Batman
jhansen
  • 1,096
  • 1
  • 8
  • 17
  • The data come from a very messy fixed width ASCII file. I don't think there's no way there's something out-of-the-shelf to convert them to JSON. I know almost nothing about JSON, would convert the data from text to JSON and then from JSON to pandas be more efficient then Ashok KS answer? Would you suggest to go down this road? – Luca Oct 26 '18 at 13:24
  • Well it looks like your data is already almost in JSON. All you would need to do is: (1) replace '=' with ':', (2) put quotes around strings, (3) put braces around each row, and (4) put square brackets around the whole thing. How easy that is depends on exactly how messy your data is... – jhansen Oct 26 '18 at 13:27
  • That's not my data format. I used it to make it understandable. My data is something like `AAA1BBB123CCC123am`. I need to know for each section identifier (`AAA`,`BBB`,`CCC` in this example) how many char are following and how they are divided (e.g if after BBB i have 123 or 1, 2 and 3 or 12 and 3 etc). To do so I have a dict, but I was looking to convert them to pandas after. Sorry for the missleading example, but I thought that the original format was too complicated and not really related to the question. – Luca Oct 26 '18 at 13:41
  • In that case it looks like the main issue is the (interesting) data format, so it is very relevant to the question indeed. If it is not a standard format it is unlikely that there will be an elegant piece of code to read it. Maybe if you give a larger sample of your actual data we can have another look. – jhansen Oct 26 '18 at 13:49
0

Assuming the data is coming as csv files. You can read each one like

>>> print data1
     a       b    c
0   10       2  5.0
1    0     NaN  8.0
2  NaN  batman  9.0
3  cat     NaN  NaN

>>> print data2
   a  b     d
0  0  0   123
1  0  0   fox
2  0  0   883
3  0  0  bats

data1 = pd.read_csv('file1.csv',header=0)
data2 = pd.read_csv('file2.csv'.header=0)

then: joined = pd.contact([data1,data2], sort=True).reset_index(drop=True)

>>> print joined
     a       b    c     d
0   10       2  5.0   NaN
1    0     NaN  8.0   NaN
2  NaN  batman  9.0   NaN
3  cat     NaN  NaN   NaN
4    0       0  NaN   123
5    0       0  NaN   fox
6    0       0  NaN   883
7    0       0  NaN  bats
Skandix
  • 1,916
  • 6
  • 27
  • 36
  • I'm sorry, but the original data is not CSV. Added a better explanation – Luca Oct 26 '18 at 13:50
  • i see that from the pdf, and from the code that you already have written, to convert into a dictionary each line, you could read a single line as one dataframe and for each consecutive lines, repeat creating dataframes and concatenating them to the previous one. This could be one solution. – Manoj M Oct 26 '18 at 14:54
  • That's exactly what I'm doing now. But concatenating in pandas is extremely inefficient as also written in the documentation. I was therefore looking for an alternative. Thanks for the idea anyway – Luca Oct 26 '18 at 15:05