1

Having a .txt file

structure as below

#n  1
a 1:0.0002 3:0.0003...
#n  2
b 2:0.0002 3:0.0003...
#n  3
a 1:0.0002 2:0.0003... 
...

trying to parse into dataframe of the following structure

#    type  1        2       3 
1    a     0.0002   null    0.0003 ....
2    b     null     0.0002  0.0003 ....
3    a     0.0002   0.0003  null   ....
...

describing the rule:

# i - 'i' is the row number
n:data - 'n' is the column number to fill, 'data' is the value to fill into i'th row

if the number of columns would be small enough it could be done manually, but txt considered has roughly 2000-3000 column values and some of them are missing.

import pandas as pd
data = pd.read_csv("filename.txt", sep = "#", header = None)

gives the following result

data1 = data.iloc[1::2]
data2 = data.iloc[::2]

I tried to remove the odd rows in data1 even in data2, then will hopefully figure out how to split the odd and merge the 2 df's, but there might be a faster and more beautiful method to do it, that's why asking here

update, spent 3 hours figuring out how to work with dataframes, as I was not that familiar with them. now from that

using

import pandas as pd
df = pd.read_csv("myfile.txt", sep = "#", header = None)
for index, col in df.iterrows():
    if index%2 == 0:
        col[1] = int(col[1].split('\t')[1])
for index, col in df.iterrows():
    if index%2 == 1:
#         print(col[0])
        col[0] = col[0].split(' ')
df[0] = df[0].shift(-1)
df = df.iloc[::2]
df = df[[1,0]]
df = df.rename(columns={0: 1, 1: 0})
df.index = range(len(df))

It became this

any suggestions on how to add unknown number of phantom columnsnd fill them using "n:value" from the list to fill the "n" column with the "value"?

2Napasa
  • 378
  • 1
  • 5
  • 17

2 Answers2

1

I think you are better off parsing the file yourself than relying on read_csv and then dealing with the mess. Here is how I would do it. Since I do not have access to your real file I am using a small example you have in your question. First, load the file.

from io import StringIO
file = StringIO(
"""\
#n  1
a 1:0.0002 3:0.0003
#n  2
b 2:0.0002 3:0.0003
#n  3
a 1:0.0002 2:0.0003
""")
# You would just use file = open('myfile.txt','r) instead of the above

Then we read all lines, group them in pairs, parse and stick the results into a dict

# read all lines
lines = file.readlines()

# here we will store the results, dictionary of dictionaries
parsing_res = {}

# a fancy way of processing two lines, odd and even, at the same time
for line1,line2 in zip(lines[::2],lines[1::2]):
    # line1 has the form '#n  1', we split on whitespace and take the second tokem
    row_index = line1.split()[1]
    # line2 is the other type of lines, split into tokens by whitespace
    tokens = line2.split()
    # first one is 'type'
    t = tokens[0]

    # the others are pairs 'x:y', split them into x,y and stick into a dictionary with label x and value y
    row_dict = {token.split(':')[0]:token.split(':')[1] for token in tokens[1:]}

    # add type
    row_dict['type'] = t
   
    # store the result for these two lines into the main dictionary
    parsing_res[row_index] = row_dict
parsing_res

Now we have something that looks like this:

{'1': {'1': '0.0002', '3': '0.0003', 'type': 'a'},
 '2': {'2': '0.0002', '3': '0.0003', 'type': 'b'},
 '3': {'1': '0.0002', '2': '0.0003', 'type': 'a'}}

this dict can now be used directly to create a dataframe, which we proceed to do and also order columns as they are in somewhat random order

df = pd.DataFrame.from_dict(parsing_res, orient='index')
df.reindex(sorted(df.columns), axis=1).reindex(sorted(df.index), axis=0)

output

    1       2       3       type
1   0.0002  NaN     0.0003  a
2   NaN     0.0002  0.0003  b
3   0.0002  0.0003  NaN     a
piterbarg
  • 8,089
  • 2
  • 6
  • 22
  • `row_index`, `tokens`, `t` were out of range, used try excempt, all working now! Takes 23gigs of RAM total for my data :) same amount as my 40 lines notebook, but now it is nice, tidy and clean! – 2Napasa Dec 08 '20 at 16:08
  • thnx! @piterbarg any suggestions to clear the RAM mem? – 2Napasa Dec 08 '20 at 16:08
  • @DvaNapasa I am not sure what you mean by 'clear the RAM mem'. Once you done parsing you cam remove intermediate calculations by for example `del lines` and `del parsing_res`. But not sure if that is your issue. You can also avoid loading your whole file in memory in one go and replace `lines = file.readlines()` with reading a few lines at a time, creating a df from those lines, and adding them to your new csv, and so on. – piterbarg Dec 08 '20 at 16:46
1

Data:

lines = [
"a 1:0.0002 3:0.0003",
"b 2:0.0002 3:0.0003",
"a 1:0.0002 2:0.0003"]
l = pd.Series(lines)
type_ = l.apply(lambda x:x.split()[0]).to_list()
df = (pd.DataFrame(
    l.apply(lambda x:{i.split(":")[0] : i.split(":")[1] for i in x.split()[1:]}).to_dict()
).T).assign(type_=type_).iloc[:, [3,0,1,2]]

df:

  type       1       3       2
0    a  0.0002  0.0003     NaN
1    b     NaN  0.0003  0.0002
2    a  0.0002     NaN  0.0003
Amir saleem
  • 1,404
  • 1
  • 8
  • 11