In our increasingly interconnected world, fraud has become a pervasive force. Criminals worldwide are using increasingly sophisticated techniques to steal data and money from hapless victims. These techniques cause losses of billions of dollars worldwide. Fortunately, technology is incredibly helpful for identifying these attempts at fraud.
In this article, we will focus on credit card transactions specifically. We will demonstrate how to use Enso to bring in multiple data sources to flag transactions identified by a variety of systems as fraudulent, from stolen credit card details to compromised payment terminals.
To start things off we’re going to read in our dataset of credit card transactions.
Enso automatically infers data types for you, even with text files, but in this
dataset, there are a few messy values in several fields that don’t quite conform
to the correct types, so we can use the
parse function to set these
appropriately and discard any values that don’t conform.
From here, we bring in our second dataset. We use Enso’s native database support
to query from our internal database directly using the same syntax as other
operations. Here we can select only the three columns we need. We’ll join the
CUSTOMER_ID match, and replace any empty
TX_AMOUNT values with the
Amount field found in the database. We can remove
Transaction ID and
Amount columns, as we’re done using them.
We identify duplicate transactions in our dataset by looking for cases where the
TRANSACTION_ID exists more than once. The aggregate method is perfect for
this. We mark these duplicates as fraudulent and bring them back into our
Now we’re going to add a row number so that we can apply our other fraud cases
to the test dataset. While we’re at it, we’ll also select distinct records by
TX_AMOUNT, and truncate the
field to remove the time, as we don’t need that level of specificity to create
Next, we’re going to input a list of terminals that have been flagged as being
compromised by our system. This dataset includes only the terminal ID and the
date it was flagged. The company’s policy is to flag all transactions from a
compromised terminal for the next 28 days after the initial report. To do this,
we add an
end_date field using the
Let’s break this method down since it’s really powerful. Here we can select a
column operation. We get a dropdown of all of the different calculations we can
perform, and once we select one, like
Date_Add, we get a dropdown listing our
columns, so we can easily select what data we want to operate on. We set our
period of 28 days, and then we set a
end_date so that Enso
creates a new field.
We use Enso’s powerful
join method to isolate all the transactions between two
dates, thanks to the handy
Between condition. This condition will
automatically look for all dates between our start and end dates for each
specific terminal. Then we can set our
Fraud column to
True, and select only
Fraud fields as these are all that we need to update our
We use the
lookup_and_replace method to update our primary dataset. We can
look up the
RowNum field, and this function will automatically update the
value of our
Fraud column for any row identified as related to a fraudulent
transaction on a compromised terminal.
Finally, we can bring in a list of flagged customer transactions from another
dataset. This dataset
CustomerFraud.csv has a lot of extra information in it,
but all we care about is marking the transactions as fraudulent, and using the
TRANSACTION_ID to update our core dataset.
So, we set
True for these records, and then use the
method to ensure unique transactions are being selected via the
operation, and then we use lookup_and_replace once more, using
to mark these transactions as fraudulent.
We have successfully created a dataset of all of the transactions in our system for this period, with fraudulent transactions correctly identified for downstream analysis, predictive modeling, or reporting use cases.
Fraud is a major threat worldwide, and it can take a variety of shapes, but using Enso we can quickly bring together and analyze data to develop better resiliency against it.