2

I have a dataframe with no column headers containing strings of information. I would like to select part of the values in the cells based on a substring in each of the cells, and include these elements in a list. The examples I've found listing an answer reference a single (named) column in a dataframe, and can easily apply loc/iloc to get the str attribute of the dataframe

The dataframe looks like below:

0 A[3,5] A[6,7] C[3,9]
1 B[2,9] A[2,1] B[3,7]
2 B[5,6]   

From the dataframe, I would like to extract all coordinates included in brackets which are listed as A in a list (as a string). The resulting list should be:

[3,5, 6,7, 2,1]

I have read data from a csv and included it in a dataframe. I have tried two approached: first, constructing a dataframe with the same dimensions and filling it with 1s where the letter A is present.

Secondly, I've tried to find the location of the occurrence of string "A" in the dataframe, and extracting the part between brackets in one go. In both approaches, I'm trying to use loc/iloc to reference the columns of the dataframe, but get an AttributeError: 'DataFrame' object has no attribute 'str'. Am I thinking in the right way, or is there a more efficient manner to solve my problem?

Update: I have been able to stack the dataframe into a series and am only left with the values where "A" is contained. Now I'm looking to isolate the coordinates based on the substring between brackets. The last line of the code below yields a TypeError: 'Series' objects are mutable, thus they cannot be hashed. How can I retrieve the substring enclosed in brackets?

df = pd.read_csv("FILE.csv", header = None)
df = df.fillna('')

s = df.stack()
s = s[s.str.contains("A")]

s = s[s.str.find("["):s.str.rfind("]")]
user13055597
  • 47
  • 1
  • 6
  • is the dataframe above just a single column? – sammywemmy Apr 07 '20 at 06:13
  • No, stacking did help in retrieving only the requested elements, but now I'm stuck to find the substring contained by the square brackets, as `s = s[s.str.find("["):s.str.rfind("]")]` yields a `TypeError: 'Series' objects are mutable, thus they cannot be hashed`. I will update the question. – user13055597 Apr 07 '20 at 12:31

1 Answers1

5

Error means there is no one column, but more columns. So if all data are in multiple columns use DataFrame.stack in first step and then processing data by Series.str.replace:

df = pd.read_csv("FILE.csv", header = None)

s = df.stack()

L = s[s.str.contains("A")].str.replace('[A\[\]]', '').tolist()
print (L)
['3,5', '6,7', '2,1']

Last if want integers in pairs is possible use list comprehension:

L1 = [[int(y) for y in x.split(',')] for x in L]
print (L1)
[[3, 5], [6, 7], [2, 1]]

Or solution by @Vishnudev with pd.eval what is more safer like eval:

L2 = s[s.str.contains("A")].str.replace('[A]', '').map(pd.eval).explode().tolist()
print (L2)
[3, 5, 6, 7, 2, 1]

Another idea with Series.str.extractall:

L2 = s[s.str.contains("A")].str.extractall('(\d+)')[0].astype(int).tolist()
print (L2)
[3, 5, 6, 7, 2, 1]

EDIT:

In your solution is necessary remove fillna for possible removed missing values by stack:

print (df)
        0       1       2
0  A[3,5]  A[6,7]  C[3,9]
1  B[2,9]  A[2,1]  B[3,7]
2  B[5,6]     NaN     NaN

s = df.stack()
print (s)
0  0    A[3,5]
   1    A[6,7]
   2    C[3,9]
1  0    B[2,9]
   1    A[2,1]
   2    B[3,7]
2  0    B[5,6]
dtype: object

s = s[s.str.contains("A")]
print (s)
0  0    A[3,5]
   1    A[6,7]
1  1    A[2,1]
dtype: object

Last for values between [] is better use Series.str.extract:

s = s.str.extract(r"\[(.*?)\]", expand=False)
print (s)
0  0    3,5
   1    6,7
1  1    2,1
dtype: object

Your solution is possibl with list comprehension:

a = [x[x.find("[")+1:x.rfind("]")] for x in s]
print (a)
['3,5', '6,7', '2,1']
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Maybe `s[s.str.contains("A")].str.replace('A', '').map(eval).explode().tolist()`? I know eval can be hazardous for unkown input. – Vishnudev Krishnadas Apr 07 '20 at 06:15
  • When running `L = s[s.str.contains("A")]` I still get the error `AttributeError: 'list' object has no attribute 'str'` after stacking the dataframe – user13055597 Apr 07 '20 at 09:14
  • @user13055597 - What is `s` ? It is `s = df.stack()` ? – jezrael Apr 07 '20 at 09:22
  • Deleting empty values from the list by running the code below was the reason for the error. I still want to substring the coordinates based on the square brackets and have updated the question accordingly. `s = df.stack() \ s = [x for x in s if x != ''] \ L = s[s.str.contains("A")]` – user13055597 Apr 07 '20 at 12:35
  • @user13055597 - Problem is need remove `df = df.fillna('')`, then `stack` remove missing values – jezrael Apr 07 '20 at 12:49