Pandas read csv only returns the first column when column names are duplicate

Question

I have OHLC data in a .csv file with the stock name is repeated in the header rows, like this:

M6A=F, M6A=F,M6A=F, M6A=F, M6A=F
Open, High, Low, Close, Volume

I am using pandas read_csv to get it, and parse all (and only) the 'M6A=F' columns to FastAPI. So far nothing I do will get all the columns. I either get the first column if I filter with "usecols=" or the last column if I filter with "names=".

I don't want to load the entire .csv file then dump unwanted data due to speed of use, so need to filter before extracting the data.

Here is my code example:

symbol = ['M6A=F']
df = pd.read_csv('myOHCLVdata.csv', skipinitialspace=True, usecols=lambda x: x in symbol)

def parse_csv(df):
    res = df.to_json(orient="records")
    parsed = json.loads(res)
    return parsed

@app.get("/test")
def historic():
    return parse_csv(df)

What I have done so far: I checked the documentation for pandas.read_csv and it says "names=" will not allow duplicates. I use lambdas in the above code to prevent the symbol hanging FastAPI if it does not match a column. My understanding from other stackoverflow questions on this is that mangle_dupe_cols=True should be incrementing the duplicates with M6A=F.1, M6A=F.2, M6A=F.3 etc... when pandas reads it into a dataframe, but that isn't happening and I tried setting it to false, but it says it is not implemented yet. Other answers I found in this stackoverflow solution don't seem to tally with what is happening in my code, since I am only getting the first column returned, or the last column with the others over-written. (I included FastAPI code here as it might be related to the issue or a workaround).

Possible duplicate of https://stackoverflow.com/questions/59935835/read-duplicate-column-names-in-csv-file. This has the solution you are looking for - the trick is to read the first row, then provide the column names as `M6A=F.1, M6A=F.2,...` to `usecols`. — viggnah, Aug 11 '22 at 04:03
this worked and in the absence of something simpler I will use it. I had seen that, but had thought it was a bodge workaround as it was over 2 years old. — mdkb, Aug 11 '22 at 04:18
For returning the DataFrame in JSON format, please don't use your current approach. Instead. please have a look at Option 1 (**Updates** 1 or 2) of [this answer](https://stackoverflow.com/a/71205127/17865804). — Chris, Aug 11 '22 at 06:43
@Chris a copy of that exact doco elsewhere is where I got the original approach from. Can you explain why its a bad approach that I chose? I will test out the one you suggested though, thanks. — mdkb, Aug 11 '22 at 07:35
@Chris. I tested update 1 and got ```ValueError: Out of range float values are not JSON compliant``` but update 2 using ```Response``` method works okay. I'd still like to know why it matters to use that rather than what I was using. — mdkb, Aug 11 '22 at 07:41
The reason is simply because, using your approach, you are first converting the DataFrame into JSON (using `df.to_json()`), then that JSON into `dict` (using `json.loads()`), and finally, when you return the `dict` from the endpoint, FastAPI, behind the scenes, automatically converts that return value into JSON again, as explained in [this answer](https://stackoverflow.com/a/71205127/17865804). — Chris, Aug 11 '22 at 08:00

Pandas read csv only returns the first column when column names are duplicate

0 Answers0