0

I have a .CSV file that has the following structure (when viewed on Excel), in which it has unique section headers (strings) occupying a single cell, followed by a block with column names and data.

This format repeats itself throughout the Excel spreadsheet. Note that each section has different col names and number, and variable number of rows.

Example:

    Daily Statements                                    

    Date        Desc    Customer ID   Phone          Status
    12/21/21    aaa     1             123-123-1231   OK 
    12/21/21    aaa     2             333-123-1231   OK
    12/21/21    bbb     3             222-123-1231   OK
    12/21/21    bbb     3             444-123-1231   OK
                                              <===== one empty row separates sections
    Account History                                 
    Date        Time    Type    Ref #   Balance
    12/21/21    1:00:00 BAL     456     $0.01 
    12/21/21    1:00:00 BAL     445     $0.01
    12/21/21    1:00:00 BAL     645     $0.01
                                              <===== one empty row separates sections
    Order History                                   
    ID    Date      Ref #
    1     12/21/21  777  
    2     12/21/21  888 
    3     12/21/21  999
    4     12/21/21  9995

My goal is to extract only the rows in Account History:

    Date        Time    Type    Ref #   Balance
    12/21/21    1:00:00 BAL     456     $0.01 
    12/21/21    1:00:00 BAL     445     $0.01
    12/21/21    1:00:00 BAL     645     $0.01

However I'm unable to find an approach that would work in Pandas, as I'd need to use the string "Account History" as an anchor to indicate the rows of interest.

Do you know how this could be achieved?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
pepe
  • 9,799
  • 25
  • 110
  • 188
  • I think it's the same problem as here [is-csv-with-multi-tabs-sheet-possible](https://stackoverflow.com/questions/29615196/is-csv-with-multi-tabs-sheet-possible/29616152) – BaoLocPham Dec 28 '21 at 02:03
  • 1
    How do you construct the data frame? You've shown what the data look like when viewed in excel, but it would be more helpful to see how they're organized in the data frame. – Elijah Cox Dec 28 '21 at 02:41
  • I believe you should try the default way to clean up the data, filling into a new dataframe then starts to do you work. By default I mean with open() ... and use split('\n\n') into different groups. – Cookie Dec 28 '21 at 03:07

1 Answers1

2

I don't see a straightforward way to do that just with Pandas. Why not read the file first just as a text file to find the rows of interest, and only then, use Pandas to import just those rows?

with open(file, 'r') as f:
    # read until the line "Account History"
    for line_n, line_content in enumerate(f):
        if "Account History" in line_content:
            break
    start_row = line_n + 1

    # continue reading, and find the following new line
    for line_n, line_content in enumerate(f):
        if line_content == '\n':
            break
    tab_size = line_n - 1

# import the dataframe, just from the target lines
df = pd.read_csv(file, skiprows=start_row, nrows=tab_size)
Daniel P
  • 140
  • 4
  • this is a beautiful and very elegant solution to my problem, it works perfectly. thanks so much for sharing – pepe Dec 28 '21 at 23:50