Filter a Pandas DataFrame by a Partial String or Pattern in 8 Ways

How to audit a DataFrame and return simply the desired rows

Filtering a DataFrame refers to checking its contents and returning only those that fit certain criteria. It is part of the data analysis task known asdata wrangling and is efficiently done using thePandas library of Python.

The idea is that once y'all have filtered this information, you lot can analyze it separately and gain insights that might be unique to this group, and inform the predictive modeling steps of the projection moving forrad.

In this article, we will use functions such equallySerial.isin()  andSeries.str.contains()  to filter the information. I minimized the utilise ofapply() andLambda functions which use more code and are disruptive to many people including myself. However, I volition explicate the lawmaking and include links to related articles.

We will use theNetflix dataset from Kaggle which contains details of TV shows and movies including thetitle,director, thecast, agerating,yr of release, andduration. Let us at present import the required libraries and load the dataset into our Jupyter notebook.

The netflix DataFrame by author

1. Filter rows that friction match a given String in a column

Hither, we want to filter by the contents of a particular column. Nosotros will use theSeries.isin([list_of_values] ) function from Pandas which returns a 'mask' ofTruefor every element in the column thatexactly matches orFalse if it does not match any of the list values in theisin() function. Note that you must ever include the value(s) in square brackets even if it is but i.

Image past author

Nosotros and so utilize this mask to the whole DataFrame to return the rows where the condition was True. Note how the index positions where the mask was Truthful above are the only rows returned in the filtered DataFrame below.

Note: df.loc[mask] generates the same results as df[mask]. This is peculiarly useful when you want to select a few columns to display.

Other ways to generate the mask in a higher place;

  • If you do not desire to deal with a mix of upper and lowercase messages in the isin() part, start catechumen all the column'southward elements into lowercase.

  • Nosotros can as well employ the == equality operator which compares if two objects are the aforementioned. This will compare whether each element in a column is equal to the string provided.

  • We tin can provide a list of strings similar isin(['str1','str2']) and bank check if a column's elements friction match any of them. The two masks beneath return the same results.

The mask returned volition exist all Trues because the 'type' column contains only 'Film' and 'TV Show' categories.

two. Filter rows where a partial cord is present

Here, we desire to check if a sub-string is present in a cavalcade.

For case, the'listed-in' column contains the genres that each movie or prove belongs to, separated by commas. I want to filter and return just those that have a 'horror' element in them because right now Halloween is upon us.

We will utilize the string methodSeries.str.contains('pattern', example=False, na=False) where'pattern' is the substring to search for, andexample=False implies case insensitivity.na=False means that anyNaN values in the column will be returned as False (meaning without the pattern) instead of asNaN which removes the boolean identity from the mask.

We will and so apply the mask to our data and display 3 sample rows of the filtered dataframe.

Other examples:

  • We can as well check for the presence of symbols in a column. For example, the 'cast' column contains the actors separated by commas, and we tin bank check for rows where there is merely one actor. This is by checking for rows with a comma (,) and and so applying the filtering mask to the data using a tilde (~) to negate the statement.

But now these results take NaNs because nosotros used na=False and the tilde returns all rows where the mask was False. We will use df.dropna(centrality=0, subset='cast)to the filtered information. We use axis=0 to hateful row-wise because we want to drop the row not the column. subset=['cast'] checks just this column for NaNs.

Note: To check for special characters such as + or ^, use regex=False (the default is True) so that all characters are interpreted as normal strings not regex patterns. You can alternatively use the backslash escape character.

3. Filter rows with either of 2 partial strings (OR)

You can check for the presence of any iior more than strings and renderTruthful if any of the strings are present. Let us check for either 'horrors' or 'stand up-up comedies' to complement our emotional states after each watch. We usestr.contains() and pass the two strings separated by a vertical bar (|) which ways 'or'.

We can also use the long-grade where we create ii masks and pass them into our data using |.

Note: You lot can create many masks and pass them into the information using the symbols | or & . The & ways combine the masks and return True where both masks are True, while | means return True where any of the masks is True.

iv. Filter rows where both strings are present (AND)

Sometimes, we want to check if multiple sub-strings appear in the elements of a column.

In this example, we will search for movies that were filmed in both United states of america and Mexico countries.

  • The code str.contains('str1.*str2') uses the symbols .* to search if both strings appear strictly in that club, where str1 must appear first to return True.

Note how 'United states of america' always appears first in the filtered rows.

  • Where the society does not affair (United mexican states tin can announced kickoff in a row), use str.contains('str1.*str2|str2.*str1'). The | means 'return rows where str1 appears get-go, or str2 appears starting time'.

See how in the fourth row 'Mexico' appears first.

  • You can also create a mask for each country, and then laissez passer the masks into the information using & symbol. The code below displays the same DataFrame equally to a higher place.

v. Filter rows with numbers in a particular column

We might as well want to check for numbers in a column using the regex pattern'[0–ix]'. The code looks likestr.contains('[0–nine]').

In the next example, we want to check the historic period rating and return those with specific ages after the dash such asBoob tube-14, PG-13, NC-17 and go out outTelevision set-Y7 andGoggle box-Y7-FV. We, therefore, add a dash (-) before the number pattern.

6. Filter rows where a fractional cord is present in multiple columns

Nosotros tin check for rows where a sub-cord is present in two or more than given columns.

  • For example, let us bank check for the presence of 'tv' in iii columns ('rating','listed_in' and 'type') and return rows where it's present in all of them. The easiest fashion is to create iii masks each for a specific column, and filter the data using & symbol significant 'and' (use | symbol to return True if it's in at least 1 column).

See how 'television' is nowadays in all three columns in the filtered data above.

  • Another way is using the slightly complicated apply() and Lambda functions. Read the article Lambda Functions with Practical Examples in Python for clarity on how these two functions piece of work.

The code for the mask above says that for every column in the list cols_to_check, apply str.contains('idiot box') function. It then uses .all(axis=1) to consolidate the three masks into ane mask (or column) by returning True for every row where all the columns are Truthful. (use .whatever() to return True for presence in at-least one column).

The filtered DataFrame is the same equally the one displayed previously.

7. Filter for rows where values in ane cavalcade are present in another column.

Nosotros can check whether the value in one cavalcade is present as a fractional cord in some other column.

Using our Netflix dataset, let us cheque for rows where the'director' also appeared in the'cast' as an actor.

In this example, we will utilisedf.employ(),lambda, and the 'in'  keyword which checks if a certain value is present in a given sequence of strings.

df.use() in a higher place holds a lambda function which says that for every row (x), check if the value in 'director' is present in the 'cast' column and return Truthful or False. I wrapped the columns with str() to convert each value into a String because it raised a TypeError probably because of NaNs. We use axis=1 to mean column-wise, therefore the operation is done for every row and the event will be a column (or series).

Whoops! That'south a lot of NaNs. Allow's drib them using the director'southward column as the subset and display afresh.

For other ways to bank check if values in one cavalcade match those in some other, readthis article.

viii. Checking column names (or index values) for a given sub-cord

We can check for the presence of a partial cord in column headers and return those columns.

Filter column names

  • In the example below, we will apply df.filter(like=pattern, centrality=ane) to return column names with the given pattern. We can as well use axis=columns. Note that this returns the filtered data and no mask is generated.

  • Nosotros can too use df.loc where we display all the rows but only the columns with the given sub-cord.

This code generates the same results similar the image above. Readthis article for how.loc works.

Filter by index values

Let u.s. first gear up the title as the index, then filter past the word 'Love'. We will use the same methods as to a higher place with slight adjustments.

Usedf.filter(like='Honey', axis=0) . We can also utiliseaxis=index.

Use df.loc[] to display the same results as above. Here we choose the desired rows in the offset part of .loc and return all columns in the second office.

Other filtering methods

  • Using the pandas query() function

This is a data filtering method especially favored past SQL ninjas. The syntax isdf.query('expression') and the outcome is a modified DataFrame. The absurd thing is that aside from filtering by private columns every bit we've done earlier, you can reference a local variable and call methods such as mean() within the expression. It too offers functioning advantages to other complex masks.

I rarely utilize this approach myself, but many people find information technology simpler and more readable. I encourage you to explore more as it has compelling advantages.This andthis articles are a good place to beginning.

  • Using other string (Series.str.) case functions

These methods can also filter data to return a mask

Serial.str.len() > 10

  • Series.str.startswith('Nov')

  • Serial.str.endswith('2019')

  • Series.str.isnumeric()

  • Series.str.isupper()

  • Series.str.islower()

Determination

Equally a data professional, chances are you volition often demand to separate data based on its contents. In this article, we looked at eight means to filter a DataFrame by the string values present in the columns. We used Pandas,Lambda functions, and the'in' keyword. We also used the | and & symbols, and the tilde (~) to negate a statement.

We learned that these functions return a mask (a column) ofTrue andFalse values. Nosotros then pass this mask into our DataFrame using square brackets similardf[mask] or using the .loc function similardf.loc[mask]. Y'all can download the full lawmakinghere from Github.

I hope yous enjoyed the article. To receive more than like these whenever I publish a new one, subscribehere. If y'all are not yet a medium member and would like to support me as a writer, followthis link and I will earn a modest committee. Thank you for reading!

References

10 Ways to Filter Pandas DataFrame

Python Pandas String Operations— Working with Text Information

Select past partial string from a pandas DataFrame