4

I have a CSV file which is downloaded from database(as it is in CSV) and now I have to parse into JSON Schema. Don't worry this link just github gist

enter image description here

Problem I am facing is its Multi line Header check CSV File Here

If you take notice in the file:

  1. On 1st line of CSV it has 1st line of headers then next line has all the values for those headers.

  2. On 3rd line of CSV file it has 2nd line of headers then next line has all the values for those headers.

  3. On 5th line of CSV file it has 3rd line of headers then next line has all the values for those headers.

Also you can notice the pattern here,

  • 1st line of headers hasn't any tab
  • 2nd line of headers has only one tab
  • 3rd line of headers has two tabs

This goes for all the records.

Now 1st problem is this multi line of headers. And 2nd problem is how to parse it into nested json as I have. one of the solution I have tried Create nested JSON from CSV. and noticed the 1st problem with my csv.

My look like this. Where I am only trying to parse initial fields of schema

import csv
import json


def csvParse(csvfile):
    # Open the CSV
    f = open(csvfile, 'r')
    # Change each fieldname to the appropriate field name.
    reader = csv.DictReader(f, fieldnames=("Order Ref", "Order 
Status", "Affiliate", "Source", "Agent", "Customer Name", "Customer Name", "Email 
Address", "Telephone", "Mobile", "Address 1", "Address 2", "City", "County/State",
"Postal Code", "Country", "Voucher Code", " Voucher Amount", "Order Date", "Item ID", 
"Type", "Supplier Code", "Supplier Name", "Booking Ref", "Supplier Price", "Currency", "Selling Price", "Currency", "Depart", "Arrive", "Origin", 
"Destination", "Carrier", "Flight No", "Class", "Pax Type", "Title", 
"Firstname", "Surname", "DOB", "Gender", "FOID Type"))

customer = []
data = []
# data frame names in a list
for row in reader:
    frame = {"orderRef": row["Order Ref"],
             "orderStatus": row["Order Status"],
             "affiliate": row["Affiliate"],
             "source": row["Source"],
             "customers": []}

    data.append(frame)
Zayn Korai
  • 493
  • 1
  • 6
  • 24
  • 2
    Please include in the question itself enough sample data to let us see what you are asking without clicking through to random third-party sites. Also, what does your attempt so far look like? – tripleee Feb 08 '19 at 09:45
  • please check. more information added – Zayn Korai Feb 09 '19 at 11:18
  • What have you tried? Most likely you would will need to keep track of the previous line, the current dept and the header for each dept. The first line is the header for dept zero, if any line is identical to the `header[n]` you make dept=n and skip the line. For each depth, for each line where the `column[dept]` is empty and the `previous_line[dept]` is not you increment dept and store this line as `header[dept]`. Enough to get started, show us some code and we help you from there. – Paulo Scardine Feb 10 '19 at 19:42
  • Check change code added Sorry but I am unable to even get start with it. There is typo in bounty description "I can't* start writing appropriate code" – Zayn Korai Feb 10 '19 at 19:59
  • Your data is all kinds of wrong, what created this? a program made by you or is this how you're getting it? If you made this fix that first because this it not how you properly use a csv file. – Jab Feb 10 '19 at 20:23
  • The easiest solution would be to fix the data itself if possible (which I'm guessing it isn't). The data is absolutely hideous and ugly! – Diptangsu Goswami Feb 10 '19 at 20:46
  • @Jaba Yup its all wrong with data and this is how I am getting it. Also I can't do anything about that, because we were using some third party software and that's how data is being downloaded by system itself and now trying to migrate it. – Zayn Korai Feb 10 '19 at 21:24
  • [ThinkData Works](https://thinkdataworks.com/) has a product specifically meant for parsing messy files like this. – manesioz Jun 08 '20 at 19:25

2 Answers2

1

This can be parsed with some recursion. The code below uses the blank records at the beginning of the line to determine that a sub-record needs to be parsed.

This code does not map the data into the desired schema as that should be done in a second step. Would suggest the ETL pattern, where this code would be part of the Extract step. The Transform step should be a separate routine which maps each of these records into the desired schema.

Code:

def get_records_from_csv(csv_filename):

    with open(csv_filename, 'r') as f:

        csv_file = iter(csv.reader(f))
        last_line = []

        def rows():
            """wrapper to csv iterator which allows a read of last_line"""
            while True:
                if last_line:
                    yield last_line.pop()
                else:
                    yield next(csv_file)

        def row_level(row):
            """how many empty fields on start of row"""
            for i, col in enumerate(row):
                if col.strip():
                    return i

        def get_records(level=0):
            header = next(rows())
            records = []
            for row in rows():
                this_level = row_level(row)
                if this_level == level:
                    records.append(dict(zip(header[level:], row[level:])))
                elif this_level < level:
                    break
                elif this_level == level + 1:
                    last_line.append(row)
                    record_type, sub_records = get_records(this_level)
                    if record_type:
                        records[-1].setdefault(
                            record_type, []).extend(sub_records)
                    if not level:
                        last_line.append(header)
                    break
                else:
                    sub_record = [
                        records[-1][k] for k in header[level:this_level]]
                    this_row = sub_record + row[this_level:]
                    records.append(dict(zip(header[level:], this_row)))

            return header[level], records

        record = get_records(0)[1]
        while record:
            yield record
            record = get_records(0)[1]

Test Code:

import csv
import json
print(json.dumps(list(get_records_from_csv('csvfile.csv')), indent=2))

Results:

[
  [
    {
      "Order Ref": "1234",
      "Order Status": "PayOfflineConfirmedManual",
      "Affiliate": " ",
      "Source": " ",
      "Agent": "akjsd@ad.com",
      "Customer Name": "Mr Kas Iasd",
      "Email Address": "asd@asd.com",
      "Telephone": "3342926655",
      "Mobile": " ",
      "Address 1": " ",
      "Address 2": " ",
      "City": " ",
      "County/State": " ",
      "Postal Code": " ",
      "Country": " ",
      "Voucher Code": " ",
      "Voucher Amount": "0",
      "Order Date": "11/01/2018 18:51",
      "Item ID": [
        {
          "Item ID": "125",
          "Type": "Flight",
          "Supplier Code": "SB",
          "Supplier Name": "Sabre",
          "Booking Ref": "ABC123",
          "Supplier Price": "5002",
          "Currency": "PKR",
          "Selling Price": "5002",
          "Depart": "12/01/2018 13:15",
          "Arrive": "ONEWAY",
          "Origin": "KHI",
          "Destination": "LHE",
          "Carrier": "PK",
          "Flight No": "354",
          "Class": "Economy",
          "": "",
          "Pax Type": [
            {
              "Pax Type": "Adult",
              "Title": "Mr",
              "Firstname": "Aasdsa",
              "Surname": "Mas",
              "DOB": "19/09/1995",
              "Gender": "Male",
              "FOID Type": "None",
              "": ""
            },
            {
              "Pax Type": "Adult",
              "Title": "Mr",
              "Firstname": "Asdad",
              "Surname": "Dasd",
              "DOB": "07/12/1995",
              "Gender": "Male",
              "FOID Type": "None",
              "": ""
            }
          ]
        }
      ]
    }
  ],
  [
    {
      "Order Ref": "1235",
      "Order Status": "PayOfflinePendingManualProcessing",
      "Affiliate": " ",
      "Source": " ",
      "Agent": "asdss@asda.com",
      "Customer Name": "Mr Asdsd Asdsd",
      "Email Address": "ads@ads.com",
      "Telephone": "3332324252",
      "Mobile": "3332784342",
      "Address 1": " ",
      "Address 2": " ",
      "City": " ",
      "County/State": " ",
      "Postal Code": " ",
      "Country": " ",
      "Voucher Code": "ABC123",
      "Voucher Amount": "100",
      "Order Date": "11/01/2018 17:06",
      "Item ID": [
        {
          "Item ID": "123",
          "Type": "Flight",
          "Supplier Code": "PITCH",
          "Supplier Name": "Kicker",
          "Booking Ref": "FAILED",
          "Supplier Price": "154340",
          "Currency": "PKR",
          "Selling Price": "154340",
          "Depart": "18/01/2018 11:40",
          "Arrive": "18/01/2018 14:25",
          "Origin": "KHI",
          "Destination": "DXB",
          "Carrier": "PA",
          "Flight No": "210",
          "Class": "Economy",
          "": ""
        },
        {
          "Item ID": "123",
          "Type": "Flight",
          "Supplier Code": "PITCH",
          "Supplier Name": "Kicker",
          "Booking Ref": "FAILED",
          "Supplier Price": "154340",
          "Currency": "PKR",
          "Selling Price": "154340",
          "Depart": "25/01/2018 6:25",
          "Arrive": "25/01/2018 10:40",
          "Origin": "DXB",
          "Destination": "LHE",
          "Carrier": "PA",
          "Flight No": "211",
          "Class": "Economy",
          "": "",
          "Pax Type": [
            {
              "Pax Type": "Adult",
              "Title": "Mr",
              "Firstname": "Asd",
              "Surname": "Azam",
              "DOB": "11/08/1991",
              "Gender": "Male",
              "FOID Type": "None",
              "": ""
            },
            {
              "Pax Type": "Adult",
              "Title": "Mr",
              "Firstname": "Aziz",
              "Surname": "Asdsd",
              "DOB": "01/07/1974",
              "Gender": "Male",
              "FOID Type": "None",
              "": ""
            },
            {
              "Pax Type": "Adult",
              "Title": "Mr",
              "Firstname": "mureed",
              "Surname": "ahmed",
              "DOB": "28/05/1995",
              "Gender": "Male",
              "FOID Type": "None",
              "": ""
            },
            {
              "Pax Type": "Child",
              "Title": "Mr",
              "Firstname": "abdullah",
              "Surname": "Cdsd",
              "DOB": "14/04/2012",
              "Gender": "Female",
              "FOID Type": "None",
              "": ""
            },
            {
              "Pax Type": "Adult",
              "Title": "Mr",
              "Firstname": "Asdsd",
              "Surname": "Ahmed",
              "DOB": "17/12/1999",
              "Gender": "Male",
              "FOID Type": "None",
              "": ""
            }
          ]
        }
      ]
    }
  ],
  [
    {
      "Order Ref": "1236",
      "Order Status": "PayOfflinePendingManualProcessing",
      "Affiliate": " ",
      "Source": " ",
      "Agent": "asda@asdad.com",
      "Customer Name": "Mr Asds Sdsd",
      "Email Address": "asd@asdsd.com",
      "Telephone": "3067869234",
      "Mobile": "3067869234",
      "Address 1": " ",
      "Address 2": " ",
      "City": " ",
      "County/State": " ",
      "Postal Code": " ",
      "Country": " ",
      "Voucher Code": " ",
      "Voucher Amount": "0",
      "Order Date": "11/01/2018 16:23",
      "Item ID": [
        {
          "Item ID": "124",
          "Type": "Flight",
          "Supplier Code": "PITCH",
          "Supplier Name": "Kicker",
          "Booking Ref": " ",
          "Supplier Price": "20134",
          "Currency": "PKR",
          "Selling Price": "20134",
          "Depart": "23/01/2018 2:00",
          "Arrive": "ONEWAY",
          "Origin": "KHI",
          "Destination": "SHJ",
          "Carrier": "PK",
          "Flight No": "812",
          "Class": "Economy",
          "": "",
          "Pax Type": [
            {
              "Pax Type": "Adult",
              "Title": "Mr",
              "Firstname": "Asds",
              "Surname": "raza",
              "DOB": "01/12/1994",
              "Gender": "Male",
              "FOID Type": "Passport",
              "": ""
            }
          ]
        }
      ]
    }
  ]
]

csvfile.csv

Order Ref,Order Status,Affiliate,Source,Agent,Customer Name,Email Address,Telephone,Mobile,Address 1,Address 2,City,County/State,Postal Code,Country,Voucher Code,Voucher Amount,Order Date
1234,PayOfflineConfirmedManual, , ,akjsd@ad.com,Mr Kas Iasd,asd@asd.com,3342926655, , , , , , , , ,0,11/01/2018 18:51
 ,Item ID,Type,Supplier Code,Supplier Name,Booking Ref,Supplier Price,Currency,Selling Price,Currency,Depart,Arrive,Origin,Destination,Carrier,Flight No,Class,
 ,125,Flight,SB,Sabre,ABC123,5002,PKR,5002,PKR,12/01/2018 13:15,ONEWAY,KHI,LHE,PK,354,Economy,
 , ,Pax Type,Title,Firstname,Surname,DOB,Gender,FOID Type,,,,,,,,,
 , ,Adult,Mr,Aasdsa,Mas,19/09/1995,Male,None,,,,,,,,,
 , ,Adult,Mr,Asdad,Dasd,07/12/1995,Male,None,,,,,,,,,
Order Ref,Order Status,Affiliate,Source,Agent,Customer Name,Email Address,Telephone,Mobile,Address 1,Address 2,City,County/State,Postal Code,Country,Voucher Code,Voucher Amount,Order Date
1235,PayOfflinePendingManualProcessing, , ,asdss@asda.com,Mr Asdsd Asdsd,ads@ads.com,3332324252,3332784342, , , , , , ,ABC123,100,11/01/2018 17:06
 ,Item ID,Type,Supplier Code,Supplier Name,Booking Ref,Supplier Price,Currency,Selling Price,Currency,Depart,Arrive,Origin,Destination,Carrier,Flight No,Class,
 ,123,Flight,PITCH,Kicker,FAILED,154340,PKR,154340,PKR,18/01/2018 11:40,18/01/2018 14:25,KHI,DXB,PA,210,Economy,
 , , , , , , , , , ,25/01/2018 6:25,25/01/2018 10:40,DXB,LHE,PA,211,Economy,
 , ,Pax Type,Title,Firstname,Surname,DOB,Gender,FOID Type,,,,,,,,,
 , ,Adult,Mr,Asd,Azam,11/08/1991,Male,None,,,,,,,,,
 , ,Adult,Mr,Aziz,Asdsd,01/07/1974,Male,None,,,,,,,,,
 , ,Adult,Mr,mureed,ahmed,28/05/1995,Male,None,,,,,,,,,
 , ,Child,Mr,abdullah,Cdsd,14/04/2012,Female,None,,,,,,,,,
 , ,Adult,Mr,Asdsd,Ahmed,17/12/1999,Male,None,,,,,,,,,
Order Ref,Order Status,Affiliate,Source,Agent,Customer Name,Email Address,Telephone,Mobile,Address 1,Address 2,City,County/State,Postal Code,Country,Voucher Code,Voucher Amount,Order Date
1236,PayOfflinePendingManualProcessing, , ,asda@asdad.com,Mr Asds Sdsd,asd@asdsd.com,3067869234,3067869234, , , , , , , ,0,11/01/2018 16:23
 ,Item ID,Type,Supplier Code,Supplier Name,Booking Ref,Supplier Price,Currency,Selling Price,Currency,Depart,Arrive,Origin,Destination,Carrier,Flight No,Class,
 ,124,Flight,PITCH,Kicker, ,20134,PKR,20134,PKR,23/01/2018 2:00,ONEWAY,KHI,SHJ,PK,812,Economy,
 , ,Pax Type,Title,Firstname,Surname,DOB,Gender,FOID Type,,,,,,,,,
 , ,Adult,Mr,Asds,raza,01/12/1994,Male,Passport,,,,,,,,,
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
  • if someone is trying to run above code add this if you are using python 3.7. `try: yield next(csv_file) except StopIteration: return ` – Zayn Korai Feb 12 '19 at 11:47
1

This is not csv parser friendly, because the file contains multiple csv, and at least one contains 2 columns with same name, which prevents from using a DictReader.

I would first build a wrapper that allows to parse each csv fragment and kindly raises a stop iteration when it reaches a different fragment. I would use the re module to find the different headers.

Here is what the code for the wrapper could be:

class multi_csv:
    levels = [re.compile('Order Ref,Order Status,Affiliate,Source,Agent,'
                       '.*,Country,Voucher Code,Voucher Amount,Order Date'),
              re.compile('\s*,Item ID,Type,Supplier Code,Supplier Name,'
                      '.*,Arrive,Origin,Destination,Carrier,Flight No,Class,'),
              re.compile('\s*,\s*,Pax Type,Title,Firstname,Surname,DOB,Gender,'
                        'FOID Type,*')
              ]
    def __init__(self, fd):
        self.fd = fd
        self.level = 0
        self.end = False
    def __iter__(self):
        return self
    def __next__(self):
        try:
            line = next(self.fd)
        except StopIteration:
            self.end = True
            raise
        for lev, rx in enumerate(self.levels):
            if rx.match(line):
                self.level = lev
                raise StopIteration('New level')
        return line

It can then be used to build a Python object according to your Json schema:

mc = multi_csv(open(csvfile, 'r')
orders = []
while not mc.end:
    rd = csv.reader(mc)
    for row in rd:
        if mc.level == 0:
            items = []
            orders.append({
                'orderRef': int(row[0]),
                'orderStatus': row[1],
                'affiliate': row[2],
                'source': row[3],
                'agent': row[4],
                'customer': {
                    'name': row[5],
                    'email': row[6],
                    'telephone': row[7],
                    'mobile': row[8],
                    'address': {
                        'address1': row[9],
                        'address2': row[10],
                        'city': row[11],
                        'county': row[12],
                        'postCode': row[13],
                        'country': row[14],
                        },
                    },
                'voucherCode': row[15],
                'voucherAmount': int(row[16]),
                'orderDate': row[17],
                'items': items,
                })
        elif mc.level == 1:
            if len(row[1].strip()) != 0:
                legs = []
                passengers = []
                items.append({
                    'itemId': int(row[1]),
                    'type': row[2],
                    'supplierCode': row[3],
                    'supplierName': row[4],
                    'bookingReference': row[5],
                    'supplierPrice': row[6],
                    'supplierPriceCurrency': row[7],
                    'sellingPrice': row[8],
                    'sellingPriceCurrency': row[9],
                    'legs': legs,
                    'passengers': passengers,
                    })
            legs.append({
                'departureDate': row[10],
                'arrivalDate': row[11],
                'origin': row[12],
                'destination': row[13],
                'carrier': row[14],
                'flightNumber': row[15],
                'class': row[16],
                })
        else:    # mc.level == 2
            passengers.append({
                'passengerType': row[2],
                'title': row[3],
                'firstName': row[4],
                'surName': row[5],
                'dob': row[6],
                'gender': row[7],
                'foidType': row[8],
                })

With your file, it gives the expected result:

pprint.pprint(orders)

[{'affiliate': ' ',
  'agent': 'akjsd@ad.com',
  'customer': {'address': {'address1': ' ',
                           'address2': ' ',
                           'city': ' ',
                           'country': ' ',
                           'county': ' ',
                           'postCode': ' '},
               'email': 'asd@asd.com',
               'mobile': ' ',
               'name': 'Mr Kas Iasd',
               'telephone': '3342926655'},
  'items': [{'bookingReference': 'ABC123',
             'itemId': 125,
             'legs': [{'arrivalDate': 'ONEWAY',
                       'carrier': 'PK',
                       'class': 'Economy',
                       'departureDate': '12/01/2018 13:15',
                       'destination': 'LHE',
                       'flightNumber': '354',
                       'origin': 'KHI'}],
             'passengers': [{'dob': '19/09/1995',
                             'firstName': 'Aasdsa',
                             'foidType': 'None',
                             'gender': 'Male',
                             'passengerType': 'Adult',
                             'surName': 'Mas',
                             'title': 'Mr'},
                            {'dob': '07/12/1995',
                             'firstName': 'Asdad',
                             'foidType': 'None',
                             'gender': 'Male',
                             'passengerType': 'Adult',
                             'surName': 'Dasd',
                             'title': 'Mr'}],
             'sellingPrice': '5002',
             'sellingPriceCurrency': 'PKR',
             'supplierCode': 'SB',
             'supplierName': 'Sabre',
             'supplierPrice': '5002',
             'supplierPriceCurrency': 'PKR',
             'type': 'Flight'}],
  'orderDate': '11/01/2018 18:51',
  'orderRef': 1234,
  'orderStatus': 'PayOfflineConfirmedManual',
  'source': ' ',
  'voucherAmount': 0,
  'voucherCode': ' '},
 {'affiliate': ' ',
  'agent': 'asdss@asda.com',
  'customer': {'address': {'address1': ' ',
                           'address2': ' ',
                           'city': ' ',
                           'country': ' ',
                           'county': ' ',
                           'postCode': ' '},
               'email': 'ads@ads.com',
               'mobile': '3332784342',
               'name': 'Mr Asdsd Asdsd',
               'telephone': '3332324252'},
  'items': [{'bookingReference': 'FAILED',
             'itemId': 123,
             'legs': [{'arrivalDate': '18/01/2018 14:25',
                       'carrier': 'PA',
                       'class': 'Economy',
                       'departureDate': '18/01/2018 11:40',
                       'destination': 'DXB',
                       'flightNumber': '210',
                       'origin': 'KHI'},
                      {'arrivalDate': '25/01/2018 10:40',
                       'carrier': 'PA',
                       'class': 'Economy',
                       'departureDate': '25/01/2018 6:25',
                       'destination': 'LHE',
                       'flightNumber': '211',
                       'origin': 'DXB'}],
             'passengers': [{'dob': '11/08/1991',
                             'firstName': 'Asd',
                             'foidType': 'None',
                             'gender': 'Male',
                             'passengerType': 'Adult',
                             'surName': 'Azam',
                             'title': 'Mr'},
                            {'dob': '01/07/1974',
                             'firstName': 'Aziz',
                             'foidType': 'None',
                             'gender': 'Male',
                             'passengerType': 'Adult',
                             'surName': 'Asdsd',
                             'title': 'Mr'},
                            {'dob': '28/05/1995',
                             'firstName': 'mureed',
                             'foidType': 'None',
                             'gender': 'Male',
                             'passengerType': 'Adult',
                             'surName': 'ahmed',
                             'title': 'Mr'},
                            {'dob': '14/04/2012',
                             'firstName': 'abdullah',
                             'foidType': 'None',
                             'gender': 'Female',
                             'passengerType': 'Child',
                             'surName': 'Cdsd',
                             'title': 'Mr'},
                            {'dob': '17/12/1999',
                             'firstName': 'Asdsd',
                             'foidType': 'None',
                             'gender': 'Male',
                             'passengerType': 'Adult',
                             'surName': 'Ahmed',
                             'title': 'Mr'}],
             'sellingPrice': '154340',
             'sellingPriceCurrency': 'PKR',
             'supplierCode': 'PITCH',
             'supplierName': 'Kicker',
             'supplierPrice': '154340',
             'supplierPriceCurrency': 'PKR',
             'type': 'Flight'}],
  'orderDate': '11/01/2018 17:06',
  'orderRef': 1235,
  'orderStatus': 'PayOfflinePendingManualProcessing',
  'source': ' ',
  'voucherAmount': 100,
  'voucherCode': 'ABC123'},
 {'affiliate': ' ',
  'agent': 'asda@asdad.com',
  'customer': {'address': {'address1': ' ',
                           'address2': ' ',
                           'city': ' ',
                           'country': ' ',
                           'county': ' ',
                           'postCode': ' '},
               'email': 'asd@asdsd.com',
               'mobile': '3067869234',
               'name': 'Mr Asds Sdsd',
               'telephone': '3067869234'},
  'items': [{'bookingReference': ' ',
             'itemId': 124,
             'legs': [{'arrivalDate': 'ONEWAY',
                       'carrier': 'PK',
                       'class': 'Economy',
                       'departureDate': '23/01/2018 2:00',
                       'destination': 'SHJ',
                       'flightNumber': '812',
                       'origin': 'KHI'}],
             'passengers': [{'dob': '01/12/1994',
                             'firstName': 'Asds',
                             'foidType': 'Passport',
                             'gender': 'Male',
                             'passengerType': 'Adult',
                             'surName': 'raza',
                             'title': 'Mr'}],
             'sellingPrice': '20134',
             'sellingPriceCurrency': 'PKR',
             'supplierCode': 'PITCH',
             'supplierName': 'Kicker',
             'supplierPrice': '20134',
             'supplierPriceCurrency': 'PKR',
             'type': 'Flight'}],
  'orderDate': '11/01/2018 16:23',
  'orderRef': 1236,
  'orderStatus': 'PayOfflinePendingManualProcessing',
  'source': ' ',
  'voucherAmount': 0,
  'voucherCode': ' '}]
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252