0

I have a text file that looks like this:

      UWI/API             WELLNAME             WELLNO              LABEL                OPER               PRODFM         SURFLAT     SURFLON      BOTLAT      BOTLON   PERFTOP PERFBASE     PERFSOURCE         WELL        WELL        WELL        WELL        WELL        WELL        WELL      SYM   
                                                                                                                                                                                                              ELEV_KB        TD      COMP_DATE   SPUD_DATE     INJLIQ      INJGAS      INJWTR           
49025000510000      CH039068            3-A                 03WC2NE01           STANOLIND O&G CO    603WLCK2               43.384316 -106.275540    0.000000    0.000000  1525.0  1815.0                         4847.00     1815.00  03/20/1925  01/01/1925        0.00   857224.00  6538652.00CO2-INJ 
49025000540000      CH036906A           8-A                 08WC2NE01           STANOLIND O&G CO                           43.383111 -106.278775    0.000000    0.000000     0.0     0.0                                     1842.00  10/07/1926  01/01/1926        0.00   757266.00  2287490.00CO2-INJ 
49025000550000      CH03906B            12-A                12WC2NE01           STANOLIND O&G CO    603WLCK                43.383171 -106.272200    0.000000    0.000000  1448.0  1703.0                         4896.00     1916.00  07/23/1924  01/01/1924        0.00  1157524.00  9007039.00CO2-INJ 
49025000550000                                                                                                                                                            1448.0  1451.0                                                                                                    
49025000550000                                                                                                                                                            1700.0  1703.0                                                                                                    
49025000580000      CH03906B            19-A                19WC2NE01           STANOLIND O&G CO                           43.380695 -106.280337    0.000000    0.000000     0.0     0.0                                     1833.00  08/27/1924  01/01/1924        0.00  3602831.00  8002146.00CO2-INJ 
49025000580000                                                                                                                                                            1565.0  1833.0                                                                                                    
49025000590000      CH03906B            23-AX               23WC2NE01           STANOLIND O&G CO    603WLCK2               43.380810 -106.273800    0.000000    0.000000  1842.0  1916.0                                     1927.00  10/19/1923  01/01/1923        0.00  2673371.00  9392905.00CO2-INJ 
49025000590000                                                                                                                                                               0.0     0.0                                                                

When I run this code, the output puts all of row one into one column. I want the first column to be "UWI/API", the second column to be "WELLNAME", ect... As you can see, the data is not separated by a comma, just by spaces. The data is sort of messed up? It is not aligned. The headers should be "UWI/API" to "INJWTR."

Here is my code:

import pandas as pd

df = pd.read_csv(r'C:/Users/mmcintyre/Documents/Teresa CO2 Injectors_Spaces.txt',sep=r'\s{2,}')

df.to_csv(r'C:/Users/mmcintyre/Documents/Teresa.csv',index=None)

And the output is this:https://sru365edu-my.sharepoint.com/:x:/g/personal/mam1064_sru_edu/Ebq0tpHveAJBu710UcDt82IB23gmRs6nNrTE-MtDUgkU6Q?e=byhjuL

I highlighted the rows that aren't in the right position. I'm not sure if this even fixable.

Thanks! Morgan

Paul H
  • 65,268
  • 20
  • 159
  • 136
  • 1
    try `sep='\s+'` – Paul H Jul 10 '20 at 00:47
  • 1
    also looks like you're not specifying your headers fully in the `read_csv` function. see here for more info https://stackoverflow.com/questions/21318865/read-multi-index-on-the-columns-from-csv-file – Paul H Jul 10 '20 at 00:50
  • Thanks Paul! How did you format that text file exactly?... I just need it to all look like that lol! I replaced the sep with what you suggested. As for the headers, where should I put it in my code? I put it before and after sep='\s+', but I keep getting errors for some reason. Sorry, I'm new! Thanks so much for your help! – Morgan McIntyre Jul 10 '20 at 01:13
  • For some reason, the perftop and perfbase column values end up in wellname and wellno for the API numbers that don't contain data for those cells. – Morgan McIntyre Jul 10 '20 at 01:25

1 Answers1

0

This data seems majorly difficult to parse. I tried a number of settings with a bunch of csv parsers, but I don't think there is a clean way to parse data this messy. I went for the dirty approach. If you only need it for this file, the following should work, if you need to do this regularly, I make no promises this will always work cleanly.

#! /usr/bin/env python

import csv

data = []

with open('Teresa CO2 Injectors_Spaces.TXT') as file_handler:

    header = file_handler.readline().strip().split()
    header2 = ['' for i in range(13)] + file_handler.readline().strip().split() + ['']

    data.append(header)
    data.append(header2)

    for row in file_handler.readlines():

        tempRow = []

        UWIAPI = row[0:20]
        WELLNAME = row[20:40]
        WELLNO = row[40:60]
        LABEL = row[60:80]
        OPER = row[80:100]
        PRODFM = row[100:120]
        SURFLAT = row[120:132]
        SURFLON = row[132:144]
        BOTLAT = row[144:156]
        BOTLON = row[156:168]
        PERFTOP = row[168:176]
        PERFBASE = row[176:184]
        PERFSOURCE = row[184:200]
        ELEV_KB = row[200:220]
        TD = row[218:228]
        COMP_DATE = row[228:240]
        SPUD_DATE = row[240:252]
        INJLIQ = row[252:264]
        INJGAS = row[264:276]
        INJWTR = row[276::]
        SYM = ''

        tempRow = [
            UWIAPI.strip(),
            WELLNAME.strip(),
            WELLNO.strip(),
            LABEL.strip(),
            OPER.strip(),
            PRODFM.strip(),
            SURFLAT.strip(),
            SURFLON.strip(),
            BOTLAT.strip(),
            BOTLON.strip(),
            PERFTOP.strip(),
            PERFBASE.strip(),
            PERFSOURCE.strip(),
            ELEV_KB.strip(),
            TD.strip(),
            COMP_DATE.strip(),
            SPUD_DATE.strip(),
            INJLIQ.strip(),
            INJGAS.strip(),
            INJWTR.strip(),
            SYM,
        ]

        data.append(tempRow)

with open('example.csv', 'w+') as fh:

    writer = csv.writer(fh, delimiter=',')

    for row in data:

        writer.writerow(row)

  • Hi @joshua.softaware.dev, how did you determine the range for the row function? I'm trying to add more columns, and I don't understand how this works. Thanks! :) – Morgan McIntyre Jul 10 '20 at 13:21
  • The input file you used seems to have data that is rightpadded with spaces past a certain point, EX the first column is alotted 20 characters, if the data is 14 characters long, they pad it with 6 spaces. The amount appears to be different per column, so I just manually counted the spacing of each column, selected a range in each row of the file (so for WELLNO for example, I did a str slice from 40:60 or char 41-60 because the start is zero index and the end is 1 indexed). Then trim with str.strip(). Slicing example here https://stackoverflow.com/questions/509211/understanding-slice-notation – joshua.software.dev Jul 10 '20 at 18:52