0

I have an Excel file (.xlsx) with content similar to the 'xlsx' dataframe below, and a csv-file ('csv' below).

Over the last week (or so) a fair amount of web-searching leads me to believe that python & pandas might be a good way to add the "os" info from the csv to the xlsx.

Unfortunately I've never done any python coding and, by extension, neither have I any experience with pandas :)

Nevertheless, inspired by many, many examples (the latest being https://stackoverflow.com/a/46550754) I've tried my luck.

Unfortunately my luck seems to have deserted me in this case.

I think that one of the issues I'm facing is that there's no direct correlation between the 2 files. One has (uppercased argh) fqdns and the other (nicely lowercased) bare hostnames.

Also, the xlsx data has a few lines of "pre-amble" in the first columns followed by some headings on row 10.

So, I've tried with various incantations involving .lower() and startswith() and also tried to wrap my mind around map() and lambdas, but, being rather unfamiliar wiht the vagaries of python (and pandas) I keep garnering all sorts of less-than-ideal exceptions that I'm also struggling with. I may be taking on a bit much, what with trying to get a grasp on python, pandas and jq (for a different issue) at the same time?

Out of shame, I'll refrain from exposing my amateurish python-hacking.

What I have:

xlsx = pd.DataFrame([
    ["FQDN1", "4.3.2.1", "finfo1", "FINFO1"],
    ["FQDN2", "4.3.2.2", "finfo2", "FINFO2"],
    ["FQDN3", "4.3.2.3", "finfo3", "FINFO3"],
    ["FQDN4", "4.3.2.4", "finfo4", "FINFO4"],
])
csv  = pd.DataFrame([
    ["host1", "hw1", "OS1"],
    ["host2", "hw2", "OS2"],
    ["host3", "hw3", "OS3"],
],
    columns=[ "Host", "hwinfo", "os"]
)

What I'm attempting to achieve:

new = pd.DataFrame([
    ["FQDN1", "4.3.2.1", "finfo1", "FINFO1", "OS1"],
    ["FQDN2", "4.3.2.2", "finfo2", "FINFO2", "OS2"],
    ["FQDN3", "4.3.2.3", "finfo3", "FINFO3", "OS3"],
    ["FQDN4", "4.3.2.4", "finfo4", "FINFO4", ""], 
],
    columns=["Host", "IP", "info1", "info2", "OS"]
)

Edit: The numerals in the various fields are merely an exhibition of my lack of imagination. The host names (and FQDNs) are (more or less) all over the place.

Also fixed the missing column in the new DF.

The join(?) condition would be that the xlsx FQDN starts with the csv hostname (case insensitive).

As an exhibit here's a couple of my (probably rather naive) attempts:

xlsx.loc[
         ( xlsx['Unnamed: 1'].str.lower().str.startswith(csv['Host']) | xlsx['Unnamed: 2'].str.lower().str.startswith(csv['Host'] )
         ) ] = csv.loc[
         ( xlsx['Unnamed: 1'].str.lower().str.startswith(csv['Host']) | xlsx['Unnamed: 2'].str.lower().str.startswith(csv['Host'] )
         ), csv['os']].values

and

xlsx['Unnamed: 5'] = np.where( (xlsx['Unnamed: 1'].str.lower().str.startswith(csv['Host']) | xlsx['Unnamed: 2'].str.lower().str.startswith(csv['Host'])), csv['os'],"")

(strictly speaking, the "magic" should leave non-matching rows untouched - or return the original content)

An additional plot twist is that the csv "Host" column occasionally has an IP# instead of a host name - hance my attempts to check for both.

Stygge
  • 1
  • 2

1 Answers1

0

It looks like you want to create a key using the digit from both Host values. You can use the .str method to slice off the last character -1 and use that as a key for joining the os from the second df.

import pandas as pd
xlsx = pd.DataFrame([
    ["FQDN1", "4.3.2.1", "finfo1", "FINFO1"],
    ["FQDN2", "4.3.2.2", "finfo2", "FINFO2"],
    ["FQDN3", "4.3.2.3", "finfo3", "FINFO3"],
    ["FQDN4", "4.3.2.4", "finfo4", "FINFO4"]],
    columns=["Host", "IP", "info1", "info2"]
)

csv  = pd.DataFrame([
    ["host1", "hw1", "OS1"],
    ["host2", "hw2", "OS2"],
    ["host3", "hw3", "OS3"],
],
    columns=[ "Host", "hwinfo", "os"]
)

new = (
    xlsx.assign(key=xlsx['Host'].str[-1])
        .merge(csv.assign(key=csv['Host'].str[-1])[['key','os']],
               on='key',
               how='left')
)
print(new)

Output

    Host       IP   info1   info2 key   os
0  FQDN1  4.3.2.1  finfo1  FINFO1   1  OS1
1  FQDN2  4.3.2.2  finfo2  FINFO2   2  OS2
2  FQDN3  4.3.2.3  finfo3  FINFO3   3  OS3
3  FQDN4  4.3.2.4  finfo4  FINFO4   4  NaN
Chris
  • 15,819
  • 3
  • 24
  • 37
  • My apologies. I've now edited my post to reflect that the host names aren't actually all that uniform. – Stygge Oct 11 '22 at 17:14