0

I have a dataset with one column and several rows per data item (the number of rows per data item is not unique). The data items are differentiated by a line '------------------------------- '

I want to transpose the data to (3) columns. The data should be split by the line '------------------------------- '

Ideally, the first two columns should be the ids and the rest of the text in whichever number of rows per data item should map to one column like id | id | text

I have tried different approaches suggested in SO but still couldn't get the desired output.

import csv
import sys
inp_fname = 'Comments.csv'
out_fname = 'Columned-Data.csv'

def rez(row, size):
    rowx = [''] * size
    for i in range(0,len(row)):
        rowx[i] = row[i]
    return rowx
MATCH = "-------------------------------\n"
cols = []
glob = []
with open(inp_fname, 'r', newline='') as in_csvfile, open(out_fname, 'w', newline='') as out_csvfile:
    reader = csv.reader(in_csvfile)
    writer = csv.writer(out_csvfile)
    for line in reader:
        if line == MATCH: 
            glob.append(list(cols))
            cols = []
        else:
            cols.append(line)
    MAX = max(map(lambda x: len(x), glob))

#output = list(map(lambda x: rez(x, MAX), glob))
#writer.writerow(output)
print(list(map(lambda x: rez(x, MAX), glob)))             

I need to remove the lines '------------------------------- ' and include only 3 rows (id, id, text) for each dataset.

Nez
  • 77
  • 9
  • possible duplicate with https://stackoverflow.com/a/39358924/8560382 – chrisckwong821 Oct 14 '19 at 07:37
  • pandas doesn't work with this dataset since it has some delimiters which are removed by the code to make it work. Hence, that approach doesn't work. I was looking for a python solution. – Nez Oct 14 '19 at 09:21
  • would you mind posting a minimal example input/expected output? – chrisckwong821 Oct 14 '19 at 09:29
  • I'm currently trying the updated code. However, there's an issue with matching the line with the MATCH. Can you show me where it's gone wrong? – Nez Oct 14 '19 at 09:49

1 Answers1

0

My source test file is as follows:

r0 xxxx
r1 xxxx, yyy
r2 xxxx, zzz
--------
r3 xxxx
r4 xxxx
--------
r6 xxxx

The first step is to read it with a non-existing separator (I chose '&'), so that each source line is the content of a single field (I named it line):

df = pd.read_csv('Input.txt', sep='&', names=['line'])

The next step is to add a "grouping" column in such a way that its value increases on each "point of contact" between a line starting with '----' (the end of the previous "logical row") and the next line starting with some other text (start of the next "logical row"):

df['grp'] = (~df.line.str.startswith('----') & df.line.shift(fill_value='')\
    .str.startswith('----')).cumsum()

The result, for my data, is:

           line  grp
0       r0 xxxx    0
1  r1 xxxx, yyy    0
2  r2 xxxx, zzz    0
3      --------    0
4       r3 xxxx    1
5       r4 xxxx    1
6      --------    1
7       r6 xxxx    2

The next step is to define a function generating a Series of strings, but without terminating '----', from a group of source rows:

def genRow(gr):
    return gr.loc[~gr.line.str.startswith('----'), 'line'].reset_index(drop=True)

The last step (reset_index) was added to have index in the result always starting from 0. This way, when results from application of this function will be concatenated, consecutive members of each group will fit into consecutive columns, starting from 0.

And to generate the final result:

  • apply this function to each group,
  • unstack to create a DataFrame,
  • drop index name (rename_axis),
  • replace NaNs with empty strings (fillna).

The code to do it is:

df2 = df.groupby('grp').apply(genRow).unstack(level=1).rename_axis('').fillna('')

This way we got:

         0             1             2
0  r0 xxxx  r1 xxxx, yyy  r2 xxxx, zzz
1  r3 xxxx       r4 xxxx              
2  r6 xxxx                            

As you can see:

  • This code works even when in some "logical rows" the number of lines (converted to columns) is smaller.
  • Any line can contain e.g. a comma, which does not separate the source text into adjacent fields.

As the final step, you may set columns attribute of this DataFrame, to have column names as you wish.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • Thank you! I tried the data extraction using JSON parsing and this issue was solved. – Nez Oct 16 '19 at 13:00