1

I have a csv with unstructured information. I want to use pivot_table (or merge ?) from pandas to have only one rows for every instance.task_id.number and to spread out metric question in several colums.

For example if i have 4 instance.task_id_number, i need to have 4 columns of metric.question I tried with pivot and pivot.table and merge but nothing match my expectation.

Thanks for your help !

#Edit : as asked, i did it as a example :

What i have :

df = pd.DataFrame([["A", 2], ["A", 3], ["A", 6], ["B", 10], ["B", 11], ["B", 12]])

what i want :

df2 = pd.DataFrame([["A", 2, 3, 6], ["B", 10, 11, 12]])

#Edit 2 : What i tried with pivot_table with the real dataframe. I put aggfunc with "metric.question and drop it in values.

I got the error :

AttributeError: 'SeriesGroupBy' object has no attribute 'index'.

I tried to reset the index but it doesn't work better. The code :

import pandas as pd

stockage = pd.read_csv(r"C:\Users\vion1\Ele\Engie\Import_Engie\asmt_assessment_instance_question.csv", encoding="cp1252")
df = pd.DataFrame(stockage)
#df = df.filter(["instance.task_id.number", "metric.question"], axis = 1)


df2 = df.reset_index(drop = True).pivot_table(index=['instance.task_id.number'],
             columns='metric.question',
             values=["instance","instance.trigger_id","instance.task_id.number","instance.taken_on","instance.state",
             "string_value","metric.order","value","sys_updated_on","instance.task_id.company",
             "instance.user.u_company_customer.u_customer_trigram","instance.task_id.contact_type",
             "instance.task_id.assignment_group"], aggfunc="metric.question")

print(df2)
df2.to_csv(r"C:\Users\vion1\Ele\Engie\Import_Engie\resultat.csv")
SRP
  • 209
  • 6
  • 15
  • 1
    Please include any relevant information [as text directly into your question](https://stackoverflow.com/editing-help), do not link or embed external images of source code or data. Images make it difficult to efficiently assist you as they cannot be copied and offer poor usability as they cannot be searched. See: [Why not upload images of code/errors when asking a question?](https://meta.stackoverflow.com/q/285551/15497888) – Henry Ecker Oct 02 '21 at 22:43
  • 1
    If you need assistance formatting a small sample of your DataFrame as a copyable piece of code for SO see [How to make good reproducible pandas examples](https://stackoverflow.com/q/20109391/15497888). – Henry Ecker Oct 02 '21 at 22:43
  • What is your expect output for `pivot`. You have `index='instance.task_id.number'` and `columns='metric.question'` but `values=?` – Corralien Oct 02 '21 at 22:54
  • I edit the question as asked. As values i want the rows of metric.question. The csvfile is larger, i took only two columns to simplify the problem but after i have to add every columns, my first step is to have one "instance.task_id.number – SRP Oct 02 '21 at 23:07

1 Answers1

1

Can you try this:

>>> df.assign(cols=df.groupby('instance.task_id.number').cumcount()) \
      .pivot(index='instance.task_id.number',
             columns='cols',
             values='metric.question') \
      .rename_axis(index=None, columns=None)

             0   1   2   3
REQ0510079  Q1  Q2  Q3  Q4
REQ0527568  Q1  Q2  Q3  Q4

Old answer

Following my comment:

data = {'instance.task_id.number': ['REQ0510079','REQ0510079','REQ0510079','REQ0510079',
                                    'REQ0527568','REQ0527568','REQ0527568','REQ0527568'],
        'metric.question': ['Q1', 'Q2', 'Q3', 'Q4', 'Q1', 'Q2', 'Q3', 'Q4']}
df = pd.DataFrame(data)

Using pivot:

>>> df.pivot(index='instance.task_id.number',
             columns='metric.question',
             values='metric.question')

metric.question          Q1  Q2  Q3  Q4
instance.task_id.number                
REQ0510079               Q1  Q2  Q3  Q4
REQ0527568               Q1  Q2  Q3  Q4

Corralien
  • 109,409
  • 8
  • 28
  • 52
  • I got the error : "ValueError: Index contains duplicate entries, cannot reshape" – SRP Oct 02 '21 at 23:07
  • It works for one column, thanks ! But then when i want to drop the filter i used and add every column on my csvfile it duplicate every columns the number of "instance.task_id.number". It does the job for every column i guess. The filter i used is : df = df.filter(["instance.task_id.number", "metric.question"], axis = 1) I still want only one instance.task_id.number, spread the metric question, but keep others values "normal" – SRP Oct 02 '21 at 23:36
  • You have to use `pivot_table` but you need to define an `aggfunc` to deal with duplicate values (first, last, mean, sum, ...). Try to setup an example with errors please. – Corralien Oct 02 '21 at 23:40
  • I tried something, using tutoriel "Pivoting with aggregating" and pandas documentation – SRP Oct 03 '21 at 00:11
  • Aha yes, why ? I think there is no private information here – SRP Oct 03 '21 at 00:19
  • I don't know how to join the created chat – SRP Oct 03 '21 at 00:32